A reconstruction of the Delta Airlines datacenter outage

Delta Airlines suffered a major datacenter outage at the airline Technology Command Center in Atlanta on August 8, 2016. The datacenter  lost power at about 2:30 ET in the morning, which caused Delta to implement a ground stop. All departing flights were holded worldwide. Delta canceled roughly 1,000 flights on Monday and another 775 on Tuesday as it worked to get its systems fully operational again. It also canceled more than 300 flights on Wednesday and a handful on Thursday. Delta operates 800+ aircraft.

Delta even used its private jet fleet on 7 flights  to fly 40 passengers  to their destination.

At 8:40 a.m. ET on August 8, Delta said the ground stop had been lifted but that only “limited” departures had resumed.

So how could this happen? A reconstruction

This article tries to explain what *could* have happened. I have no insights in what actually happened and in the root cause.

Delta uses many computer systems for its operation.

The Travelport system that is used by Delta Airlines for flight reservations was not affected. The software runs on a Delta managed z/TPF mainframe. At some airports staff familiar with the Travelport system were able to manually check-in passengers. The reason this system was not affected by the power outage is most likely it was located in a different area of the Atlanta datacenter or is located in a different datacenter. Travelport before being acquired by Delta was a separate managed organization.

Delta Airlines computer systems responsible for  online check-in, kiosks, flight dispatching, crew scheduling, , airport-departure information displays, ticket sales, frequent-flier programs and flight info displays are all located in a single datacenter located in Atlanta, Georgia. Most likely for cost reasons Delta Airlines decided not to operate a twin datacenter concept. Atlanta in the past has not been hit by any serious earthquakes nor floodings. In the past Atlanta area had  hurricanes and tornado’s (like in 2008) but not at a scale which can damage a datacenter.

So probably the financial resposible management of Delta believed a single datacenter was the best option.

The Delta datacenter is most likely located at 760 Doug Davis Drive, Hapeville,  next door to Hartsfield-Atlanta airport. The datacenter was sold by Delta in 2012 to Digital Realty Trust and leased back by Delta.

The information below was copied from cloudrfp.com

760 Doug Davis is three stories and was constructed in 1991. The property is located in Hapeville, GA, a suburb of Atlanta, and is directly adjacent to the Atlanta airport. The property measures 334,306 rentable square feet, with 188,568 square feet of raised floor, on 9.50 acres of land. The single building is leased to three customers in the transportation and communication industries. (note: Delta Airlines, Travelport and an unknown customer)

The facility’s electrical infrastructure is supported by an on-site utility owned substation. The substation is fed by two diverse 20 kV feeders, providing 10 MVA of utility power at 2N. The property was designed to be scalable as additional electrical and mechanical capacity is required.

Situated in one of the major emerging markets, Atlanta offers electric rates that are significantly lower than the national average, and a favorable tax structure catered towards Data Centers.

The datacenter has 2 utility feeds from separate utility grids. On Google Earth sat image 8 generator exhausts can be spotted on the roof. Four of these generators should be able to provide power to the datacenter. The other four are for redundancy reasons.

So the facility had 4 independant power sources: utility 1, utility 2, generator bank 1 and generator bank 2.

Timeline of events

At August 8 at around 2:30 AM IT staff of Delta performed a routine scheduled switch to the backup generator. So this is good. Delta IT-staff does regular testing of the backup power. The test however resulted in a spike which  caused a fire in a  Automatic Transfer Switch (ATS). The firebrigade was called.  Firefighters took a while to extinguish the fire. About 500 servers were shutdown because there was no power anymore.

Georgia Power stated other Georgia Power customers were not affected because it was an issue with Delta equipment, and said Georgia Power crews were on site working with Delta to repair the equipment.

Delta was able to restore the power quickly. Unknown is how much time it took.  But when this happened, critical systems and network equipment didn’t switch over to backup power. Around 300 of about 7,000 data center components were discovered to not have been configured appropriately to available backup power.

Datacenter power architecture

Any professional datacenter has  two independant power sources. First the utility power, and second backup power supplied by diesel generators. The power feed into the datacenter is controlled by what is called an automatic transfer switch or ATS. The  primary purpose of a ATS is to divert the path of electricity from one source to another.

The image below (credit Sun) shows the power flow in a datacenter. In this image there are two ATS shown.

The UPS makes sure that components in the datacenter recieve power when the ATS switches from utility power to the generator power. This because generators require some time to startup after utility power fails.

power-datacenter

 

As the Delta datacenter had no power after a fire in the ATS, it seems Delta did not have a redundant power architecture.  And if they did, the backup ATS for sure did not activate as it should. So when the sole/primary ATS switch failed, the power from either the utility nor generator could feed servers, network gear, storage etc.

Another cause could have been an issue with a switchboard. To perform maintenance on the ATS, the ATS can be bypassed by using switchboard. It could be the switchboard was in the same room as the ATS was was damaged/broken because of the fire.

 

Taken from Reddit

Power flows like this on a datacenter bus:

Utility -> Switch board -> ATS -> Switchboard -> UPS -> Switch board -> Floor PDU -> Rack PDU

Each switchboard allows you to maintenance bypass the downstream device. E.g, I could replace the ATS by maintenance bypassing on the Utility/ATS switchboard. This is great…

…except when I have a fire and they’re in the same room. The switchboard was likely near the ATS and they couldn’t bypass. “Did not have a redundant power architecture” means there wasn’t a full separate bus. They had an ATS fire, no power was getting to the UPS room, and eventually the batteries died.

They should have had a system like this:

Utility -> Switch board -> ATS -> Switchboard -> UPS -> Switch board -> Floor PDU -> Rack PDU A-Side

Utility -> Switch board -> ATS -> Switchboard -> UPS -> Switch board -> Floor PDU -> Rack PDU B-Side

Then each server has dual PSUs, one on each of the A/B buses.

ATS are very reliable. If well maintained the mean time between failure is in the range of  400,000 to
1,000,000 hours. (45 to 114 year).

So Delta airlines or the owner of the datacenter likely because of costs decided to use a single ATS. An ATS for a 800kW datacenter costs around $ 250.000. Add installation costs and the total installation cost would be less than $ 500.000. Far less than the costs Delta airlines has because of the datacenter outage.

There has been a misconfiguration of about 300 components as well causing the 8 hours disruption. These did not receive power once the power feed into the datacenter was restored. It is not clear what the faulty configuration was. Maybe the components were connected to a power feed which was not delivering power. Maybe the power distribution units fuses failed because of a high load. We do not know the details and I guess it will never be made public.

Someone on Reddit stated:

Its my understanding that some of the core routers in ATL had both power supplies on the same UPS. Also other routers were not at the correct patch level and failed to pick up.

Costs

While at the time of writing it is not clear what the costs of the outage are, we know from the outage of Southwest Airlines what the costs are. Linkis.com reported here:

Three weeks ago, Southwest canceled 2,300 flights following a “system outage” that lasted about an hour. The outage resulted in at least $54 million in lost revenue,according to figures that the company released on Wednesday. That figure could rise to as much as $82 million when all costs are tallied.

The revenue hit from refunded tickets, missed bookings, canceled flights, vouchers and more came out to at least $25.7 million. Meanwhile, increased costs from “staff overtime, transportation, hotel and meal accommodations for stranded travelers and crew, and other expenses” could result in charges between $28 million and $57 million.

3 Comments

Add a Comment

Your email address will not be published. Required fields are marked *

Current ye@r *