Power is the most underestimated required resource to operate datacenters. Many believe the UPS or diesel generator will do the job when utility power fails.
Wrong! There are many documented failures in which a datacenter went ‘all lights out’ because the power feed failed.
UPS failure remains the leading cause of data center downtime. Including UPS systems and batteries, these types of failures caused 25 percent of outages in 2015 reported by survey participants. This is up from 24 percent in 2013 but down from 29 percent in 2010 (source)
This post provides a summary of some known failures. Many other failures were never made public.
At July 24, 2007 at 1:47 p.m. 365 Main’s San Francisco data center was impacted by a power surge caused when transformer breakers at a local PG&E power station unexpectedly opened. PG&E has still not determined what caused the breakers to open. Typically when a power outage occurs, the outage triggers 365 Main’s rigorously maintained and tested back-up diesel generators to start-up and take over providing power supply to customers. 365 Main’s San Francisco facility has ten 2.1MW back-up generators to be used in the event of a loss of utility power. Eight primary generators can successfully power the building, with two generators available on stand-by in case there are any failures with the primary eight. However, following the power outage last week, three of 365 Main’s 10 back-up power generators, manufactured by Hitec, failed to complete their start sequence.
After days of thorough testing around the clock, the team discovered a weakness in an essential component of the back-up generator system known as a DDEC (Detroit Diesel Electronic Controller).
The team discovered a setting in the DDEC that was not allowing the component to correctly reset its memory. Erroneous data left in the DDEC’s memory subsequently caused misfiring or engine start failures when the generators were called on to start during the power outage on July 24.
These Hitec UPS *could* have been the same as the ones BA used.
In June 2008 a “The Planet” datacenter in Houston hosting 9000 servers was down for many days. Electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment. The firebrigade did not allow to switch over the secondary power. (source)
At October 11, 2009 a generator failed during planned maintenance at IBM’s commercial data center in Newton (outside Auckland) New Zealand on Sunday, October 11 at 9:30 AM. Local media reports that a failed oil pressure sensor on a backup generator was the likely cause. (source)
The outage severely impacted Air New Zealand who outsourced their mainframe and mid-range systems to IBM. The shutdown impacted airport check-in systems, online bookings and call center systems. Overall, the outage impacted over 10,000 passengers and threw airports into disarray, according to a local media report.
In January 19, 2010 a NaviSite data center in Silicon Valley was without power for an hour after severe storms knocked out the facility’s utility power from PG&E. NaviSite’s San Jose data center lost utility power from PG&E at 4:45 a.m. Pacific time, and backup power systems failed to operate as designed. (source)
In July 2010 a datacenter hosting Wikipedia was down for one hour due to power failure (source)
At January 20, 2012 a power outage in an Equinix data center in California caused problems for a number of customers, most notably Zoho, which experienced hours of downtime for several of its web-based office applications. Equinix acknowledged the incident, but did not provide details on the cause of the outage at its SV4 facility in Silicon Valley.
At June 2012 an AWS datacenter was without power becasue of a faulty breaker. (source)
At July 10, 2012 Salesforce was down due to power issues in Equinix data center in Silicon Valley (source)
At June 29, 2012 a power-outage at an Amazon Web Services data center in Virginia which caused service outages or many online businesses – including popular ones like Netflix and Pinterest was caused by a back-up-generator failure, the company said in a summary of the incident posted on the website of Amazon’s cloud-services business. (source)
Hurricane Sandy in October 2012 caused many issues for datacenters because of floodings. (source)
At January 3, 2013 Sears had a power down in a datacenter. The problems began Jan. 3 with the failure of one of four uninterrupted power supplies that keep juice flowing to the data center while keeping the equipment safe from electrical ebbs and spikes. All four power supplies subsequently failed, followed by the failure of a bypass power setup, shutting down Sears’ computer systems, including its website, before it could restore power with generators and bring the computers back up.
The five-hour failure, in the rush just after the holidays, cost Sears $1.58 million in profit, according to the lawsuit. (Sears did $12.3 billion in sales during the fourth quarter but lost $489 million.) The server farm ran on generators for eight days, burning through $189,000 in diesel fuel. (source)
At January 24, 2013 more problems for Sears. Three of the four power supplies failed again , shutting down the data center until Sears could fire up its generators and get computer systems back online. Sears didn’t specify how long the system was down but says it lost $630,000 in profit in the meantime. The generator ultimately failed, and Sears had to rent one for $13,500 per week.
Not a datacenter this time, but in February 2013 Super Bowl went into black as power failed. The reason was a switch gear issue. (source) The same issue Delta Airlines suffered from in 2016.
In February 2015 something went wrong in a BT datacenter in Nieuwegein, the Netherlands. This datacenter housed critical computers for controlling Dutch railways railway junctions and signage. After maintenance to the power feed, things went wrong. The power supply of the storage failed as well as other power supplies. More here.
In August 2015 Fujitsu has acknowledged that a major power transformer failure took down its Sunnyvale data center on August 22nd. The failure was not within the data center itself but in the substation that provides power to the facility and under the control of their energy provider, not Fujitsu. It appears that the external failure did cause cascading failures within the data center proper (source)
In mid November 2015 a Telecity datacenter in London was down for severall hours. The Register reports:
Telecity has suffered a major outage at one of its London data centres this afternoon, which knocked out a whole host of VoIP firms’ services, made Amazon wobble and borked its Direct Connect service. A source told The Register that the outage, which happened at around 2pm, knocked out four floors at Telecity’s Sovereign House
Someone wrote in the comments section:
Both UPS channels went offline in a cascade failure due to loading. Then the transfer to mains was disruptive and the transfer back to UPS failed.
And the second attempt to switch back has also failed and that involved switching it off and on again so it was a proper IT fix, not some bodge.
Currently it’s running on utility power. It’s not the first time the UPS systems at Sovereign House have gone out like this either…
Some more details here.
In January 2016 a Verizon datacenter which hosts US airline Jetblue was down for several hours because of power failure. The failure caused flight delays and shut down the airline’s website, along with its online booking and check-in systems (source)
In June 2016 an AWS datacenter in Sydney, Australia went all black. In a post-mortem of the event AWS revealed a diesel rotary uninterruptible power supply (DRUPS) had failed to properly switch to its reserve power when the utility power fell over. The most-mortem is here. More news here.
In July 2016 the same Telecity datacenter in London had power issues for a short time (link)
In July 2016 a datacenter hosting Comcast was down due to power issues (source)
At July 21, 2016 a National Science Foundation datacenter was down (source)
In August 2016 a Delta Airlines datacenter in Atlanta failed because power issues. Described here.
At September 10, 2016 a Global Switch data center outage in the London Docklands has been traced to a cable end box on a UPS system. (Datacenterknowledge)
At May 24, 2017 a Capita datacenter in the UK went all black. Described here and here
At May 27, 2017 a British Airways datacenter failed because of UPS issues. Described here.
At March 31, 2017 a Microsoft Azure datacenter in Azure had issues because a UPS failed. The UPS was used to power cooling equipment. As cooling was not operational, heath increased and servers powered down (source)
Some more datacenter power failures here.