What went wrong in British Airways datacenter in May 2017?

At Saturday May 27 2017 all flights operations  of British Airways at both London Heathrow and Gatwick airport halted because of an IT problem ultimately stranding 75,000 passengers in 170 airports across 70 countries.

This resulted in a lot of attention by the British press. Times reported on June 2 that a contractor on error switched off the UPS. However the contracting company denied this in the Guardian.

However, at June 5 Willie Walsh. boss of IAG (parent company of BA), said that an electrical engineer disconnected the uninterruptible power supply which shut down BA’s data centre. (BBC). IAG has commissioned an “independent company to conduct a full investigation” into the IT crash and is “happy to disclose details” of its findings, Mr Walsh said.

At least 1,000 flights were cancelled at Heathrow and Gatwick over the weekend.

BA’s operations were returning to normal on Monday and it said it would fly more than 95 percent of its normal flight schedule.

About £180 million was knocked off IAG’s value on Monday. At one point shares in the company fell by more than 4 per cent before recovering towards the end of the day.

Compensation claim alone could cost BA over 100 million Pound.

The Guardian at June 3 wrote a summary of what happened.

At June 15, IAG told investors that  the power outage will going to cost British Airways £80m. IAG did not specify if this was just costs for compensation of passengers.(The Register)

Issues with redundant power feed in datacenter are not that uncommon. A few examples are provided in this article.

Dutch trains were not able to operate in 2015 when a ProRail datacenter was without power despite redundant power feeds. Important IT equipment like storage and servers were damaged because a power surge. Possibly because of an UPS bypass which does not filter spikes. However this is guessing as it is uncommon for organization to make public the root cause of IT disasters.

All that has been written in this post was taken from public sources and using my knowledge as IT consultant.

TL;DR

Combining all the different sources shown later in this post, a fairly accurate reconstruction of what went wrong can be provided.

The issue BA suffered is very similar to the total failure Delta Airlines experienced in August 2016 as described here.

There are many more complete failures of power in datacenters, some running mission critical applications. This post has an overview. In most cases the IT-equipment is not damaged. BA was very unlucky and probably made wrong decisions in the past about power.

At Saturday May 27 shortly after  09:30 there was a failure in the uninterruptible power supply (UPS) of one of two BA datacenters in London called ‘Boadicea House’. Most likely the UPS was not the device equipped with lots of batteries to supply power as long as the emergency diesel powered generator is not supplying power.

In the datacenter of BA probably a DRUPS was used. Instead of batteries a flywheel provides power between failure of main and start of diesel generator.

The Guardian reported at June 01

A contractor with knowledge of Boadicea House said: “It’s a very old facility, there are lots and lots of problems with it. We weren’t particularly surprised, knowing the set-up there.” He added that a number of senior managers at the data centre have retired or left in the past three years.

BA could be using Rotary diesel UPS systems (aka Dynamic UPS or DRUPS ) as this website of a fire suppression vendor states. This is crucial in understanding what went wrong.

Thw rotary UPS could be of the Dutch vendor Hitec. Hitec shows on the website that British Airways is a customer.

International Business Times states in an article dated June 2 2017  that Hitec UPS is indeed used by BA. However we are not sure if this is confirmed by BA!

 

Contrary to static UPS which uses batteries, a dynamic UPS uses kinetic energy to store energy. This is guessing but Dutch Hitec (Former Holec) could be the supplier.

See below this post you will find a very detailed comment about what very likely went wrong! Recommended to read.

In normal operation, power is feed thorugh the green line. For maintenance and redundancy, the auto by-pass can be used. This black circuit feeds IT equipment directly from utility power.

What likely happened is a failure in a component of the green line. The flywheel can typically supply only 8 to 10 seconds of backup. It is unknown if the UPS had batteries for temporary power.

What could be it that the switch gear firmware somehow prevented the alternative power feed to become active. AWS James Hamilton describes how AWS modified the switchgear firmware used in AWS datacenters.

See the presentation by Hamilton on the Delta Airlines power failure.

 

The auto by-pass did not work. As a result power failed for some minutes. Someone in the datacenter had to manually switch the auto bypass. Or some other UPS kicked in. Or the diesel engine kicked in but too late as the flywheel did not provide power.

This resulted for unknown reasons in a power surge damaging power supplies of IT-equipment.

This book provides some additional interesting background on UPS in BA datacenters (although maybe outdated)

 

Photo of Boadicea House by Google Streetview

Minutes later power was restored was resumed in what one source described as “uncontrolled fashion.” Instead of gradual restore, all power was restored at once resulting in a power surge.   BA CEO Cruz told BBC Radio this power surge  caused network hardware to fail. Also server hardware was damaged because of the power surge.

It seems as if the UPS was the single point of failure for power feed of the IT equipment in Boadicea House . The Times is reporting that the same UPS was powering both Heathrow based datacenters. Which could be a double single point of failure if true (I doubt it is)

 

The broken network  stopped the exchange of messages between different BA systems and application. Without messaging, there is no exchange of information between various applications. BA is using Progress Software’s Sonic ESB for to enterprise service bus (the “highway” for exchange of messages”).

The Enterprise Service Bus is the heart of a complex enterprise IT architecture. One of the functions is to allow messages to be exchanged between various systems. For example a booking on the BA.com website into the system which stores all details of the booking.

A power surge can occur for several reasons. For example, high-power electrical devices can create a spike in the electrical current when they’re switched on or when their motors kick on. Refrigerators, air conditioners and even space heaters can cause a power surge strong enough to damage electrical systems.

Power surges in datacenters can have various causes. For example a human error. An engineer makes an mistake and places 380V on a 220V power group. A similair failure happened in a Dutch datacenter in 2015. This datacenter was used by ProRail, responsible for train traffic control. The computer responsible for railway junction failed because of maintenance on power earlier. (source)

The BA.com website was reported to present 404 errors. This indicates there were more issues than just an enterprise service bus. Without the ESB operational, a website should look normal until a ticket is booked.

So it is very well possible there was a power fault in the datacenter hitting all systems!

Press coverage about this failure has been of poort quality. Newspapers quote so called experts saying UPS issues are very rare. Well, read the blog of AWS James Hamilton.

 

Questions which need answers

There are many questions needed to be answered.

For example, in a normal configuration in a datacenter, each component (servers, storage, networking) has to independant power feeds. One feed is directly from the national grid, the other is from an UPS. The UPS is feed by a diesel power generator.

It seems these two circuits were somehow connected, resulting in a failure of both power feeds.

The other question is about the failover to the secondary datacenter. This most likely did not happen! Why not, as there would for sure be a good reason to fail over! Was it because IT staff did not dare to failover? Was it because it was never tested? Was it because a failover would result in having to use old data?

Or was it because the other datacenter did not have enough spare capacity available? What is quite common when two datacenters are used, is that resources in the secondary datacenter are used for production purposes. Normally consumption of resources should never exceed 50% of the sum of capacity of both datacenters.

If true, this is just bad resource management by BA IT department.

Introduction

Lets get into the details of this mega IT failure.

Experts claim the damage will be very high. The Sun reports:

The airline’s check-in and operational systems crash yesterday – which saw at least 200,000 passengers trying to travel on Bank Holiday weekend left stranded – is set to cost the company £300million in compensation.

BA CEO Cruz said on Saturday in a video message posted on Youtube  power supply issue caused global IT failure (Reuters).

“We believe the root cause was a power supply issue and we have no evidence of any cyber attack,” Alex Cruz, Chairman and CEO of British Airways, said in a video message on Twitter.

A power supply issue sounds weird. What could be the case is a power failure at one of BA’s datacenters resulting in systems down. That alone is a very unlikely event. In case the grid power is not available, UPS will supply power before diesel generator provide power. However thing can go badby wrong as Delta Airlines experienced in 2016.

BA did not reveal on Saturday more than “power supply”. In several forums on the internet there was no inside information posted.

BA has 6 datacenters in the UK, two of them in London. 500 data cabinets spread across 6 halls  are in two  sites are near BA’s Waterside HQ at Heathrow. This site provides some more info on cooling.

For sure BA has a datacenter in Boadicea House. (Boadicea stand fro British Overseas Airways Digital Information Computer for Electronic Automation) . The other confirmed location is  Cranebank near London Heathrow. Stetements in media like by Daily Mail about a datacenter in Comet House are incorrect. Comet House was demolished in the late 90’s!!

Read more: http://www.dailymail.co.uk/news/article-4565236/IT-engineer-blame-BA-s-150million-global-meltdown.html#ixzz4irl9xZun
Follow us: @MailOnline on Twitter | DailyMail on Facebook

Both BA datacenters are located less than 1 km from eachother.

Cranebank is also the location for BA Flight Training.

British Airways new computer data centre at Heathrow (Cranebank)

While BA did not explain which type of UPS failed, it seems UPS of Socomec were installed in the Boadicea House datacenter. Taken from here

The Socomec UK team has just completed the final phase installation of a custom designed SMART PowerPort providing a flexible site power extension for British Airways at Boadicea House Computing Centre supporting the airline’s accounting, engineering, crew management and personnel functions.

On Monday May 29 British Airways made a new statement (The Sun):

British Airways said a power surge that collapsed its IT systems, leading to travel chaos for thousands of passengers over the weekend, was so strong that it also knocked out its back-up systems, making them ineffective.

Also on Monday CEO Cruz told BBC there were local power problems at Heathrow which only lasted a few minutes.

A spokeswoman for British Airways elaborated.

“It was a power supply issue at one of our U.K. data centres. An exceptional power surge caused physical damage to our infrastructure and as a result many of our hugely complex operational IT systems failed,” she said.

Cruz on Monday afternoon told some more details (TheGuardian):

The worldwide IT meltdown was caused by a power surge at about 9.30am on Saturday that had a “catastrophic effect” on the airline’s communication hardware, “which eventually affected all the messaging across our systems,” Cruz said.

He rebuffed a claim from the GMB that the situation had been worsened by the outsourcing of IT jobs to India. “I can confirm that all the parties involved around this particular event have not been involved in any type of outsourcing in any foreign country. They have all been local issues around a local data centre who has been managed and fixed by local resources,” he said.

Cruz talked to BBC Transport correspondent on the BBC radio here. He reveals some more details on what went wrong in about 12 minutes into the program:

“on Saturday morning we did have a power surge in one of our datacenters which affected the networking hardware which stopped messaging. Millions and millions of messages that come between different systems and applications within the BA network and affected all the operational systems , baggage system, passenger processing.

Cruz then told there is a backup system but for unknown reason it did not kick in.

Cruz told Sky News the same.

 

In the same interview Cruz denied that oursourcing IT to India was part of the problem. The issue occured in a local datacenter and was resolved by local staff.

Daily Mail, a not so reliable newspaper in the UK, reported on Tuesday May 30.

 

Energy firm Scottish and Southern Electricity Networks (SSE), which supplies power to the company’s headquarters in Harmondsworth, said there was no recorded power surge on its side of the meter.

UK Power Networks, which supplies energy to Heathrow, also said it had seen no electrical issues.

At May 30, four days after the  failure BA stated:

“Our IT systems are now back up and running and we will be operating a full flight schedule at Heathrow and Gatwick on Tuesday 30 May.”

On Tuesday May 30, Daily Telegraph revealed some more details on the root cause:

For the first time, The Telegraph can reveal exactly what caused the airline’s operational and booking systems to collapse on Saturday.

It is understood that the investigation, being led by Mr Cruz himself with the help of external power supply specialists, is focusing on the uninterruptible power supply (UPS) to Boadicea House, one of two data centres in the environs of Heathrow airport.

 In BA’s case, the UPS in question delivers power through the mains, diesel and batteries.

On Saturday morning, shortly after 8.30am, power to Boadicea House through its UPS was shut down – the reasons for which are not yet known.

Under normal circumstances, power would have been returned to the servers in Boadicea House slowly, allowing the airline’s other Heathrow data centre, at Comet House, to take up some of the slack.

But, on Saturday morning, just minutes after the UPS went down, power was resumed in what one source described as “uncontrolled fashion.” “It should have been gradual,” the source went on.

This caused “catastrophic physical damage” to BA’s servers, which contain everything from customer and crew information to operational details and flight paths. No data is however understood to have been lost or compromised as a result of the incident.

BA’s technology team spent the weekend rebuilding the servers, allowing the airline to return to normal operations as of today.

Sources close to the airline indicated that had the power been restored more gradually, BA would have been able to cope with the outage, and return services far more quickly than was the case.

The Times reported

The uninterruptible power supply is supposed to use multiple power sources, including a stand-by generator, to maintain a constant supply to BA’s two main IT centres near Heathrow.

Computer Weekly reported at June 9 that human error is not the main cause. BA probably did not spent enough money in the datacenter resiliency.

IT issues in the past

The issue reminds to an earlier IT issue at British Airways. At April 11 2017 the website of BA was down for around eight and a half hours , leaving customers unable to check in online.

According British Airways “caused by a technical fault, believed to be related to an IT system upgrade to some databases we carried out tonight” (Source Twitter)

At September 6, 2016 there was another serious issue with IT systems of BA. It  took severall hours to restore checkin . (source)

 

Power feed in datacenters

The image below shows a well designed, redundant feeds for power to a datacenter.

Two different utulity power feeds as shown in the top of the image (10kV 1MW).

Two ATS switches. Two diesel generators.

IT-equipment has two different power feeds. In this situation both feeds are powered by two UPS systems.

If one fails. the other UPS will continue to provide power. As IT equipment has two power supplies, the feed from the UPS still operational will take over.

In the case of British Airways, likely all power supplies had the same single UPS feed.

 

Rotary diesel UPS systems

Inside the BA datacenter most likely a Rotary diesel UPS system was installed.

Taken from here

Rotary diesel UPS systems, also referred to as dynamic UPS or no-break systems, provide uninterrupted, continuous and conditioned power supply. They protect your critical processes against power failures and actively filter out any impurities from the power supplied by the network (power surges or dips, harmonics, transients, brownouts, etc.)

Unlike static UPS systems with batteries, a dynamic UPS consists of (1) a diesel engine, (2) a freewheel, (3) inductive coupling and (4) a generator, where the entire unit is mounted on a frame.

Under normal conditions – i.e. when there is power from the network – the dynamic UPS acts as an active filter against impurities from the network. If the normal power supply fails or is disturbed, the dynamic UPS takes over the task of supplying power – with no interruption, without you noticing any difference, for varying periods of time, and for as long as there is enough fuel.

The image below was taken from this Schneider document.

What systems went down.

Some names of systems in use by BA are: SIP / CAP / CAP2 / Flight Information and Control of Operations (FICO) / FLY

What we know is:

  • British Airways outsourced its IT to a couple of IT companies in India. Tata Consultancy is one of the companies.In 2016 BA made hundreds of expert IT staff in UK redundant and outsourced the work to India.
  • The problem has affected multiple parts of the business which are not only customer-facing, but also operational-facing, and without which the airline could not do many tasks, for example completing load sheets [which are needed for fuel calculations] for aircraft.
  • Systems of Aer Lingus Iberia and Vueling, all owned by IAG just like BA, were operating normally.
  • The issue started at Saturday, the start of a Bank Holiday in the UK
  • The website of BA when accessed from Europa presented a 404 error . This lasted a couple of hours
  • The website of BA when accessed from US was normall untill a flight was booked.
  • The systems that are used to send emails and texts to individual customers about their flights have also been affected by the IT problems, so BA hasn’t been able to communicate with customers in our usual ways.”
  • BA Contact Centers were affected by the IT failure
  • BA Twitter account was the only way for BA to communicate with its customers
  • BA Media website is not functioning
  • BA chief executive Alex Cruz said the ‘power supply issue’ had affected all check-in and operational systems
  • The  baggage system had broken down as well
  • Online check-in machines failed to work – with the screen saying it was ‘temporarily out of service’
  • One of the things reported on PM on Radio 4 that early on when people scanning boarding passes were getting incorrect destinations on the screen. They reported that someone flying to Sweden got 3 different incorrect destination when the card was scanned.
  • It was also reported that at least initially BA’s phones weren’t working at Heathrow
  • “there has been no corruption or any compromise of any customer data.”

The problem seems to have started in the early morning of May 27. This is the first Tweet by Heathrow Airport mentioning the issue. This was at 12:48 PM.

Issues on Friday?

Someone posted a comment on the Register stating:

That is NOT the clusterf*ck they experienced though because their messaging and transaction was half-knackered on Friday. My boarding pass emails were delayed 8 hours, check-in email announcement by 10 hours. So while it most likely was the messaging component, it was not knackered by a surge, it was half-dead 24h before that and the “surge” was probably someone hired on too little money (or someone hired on too much money giving the idiotic order) trying to reset it via power-cycle on Sat.

This indicates the Progress Enterprise Service Bus used by BA already had issues on Friday.

Possible causes

In this case the downtime of IT system is a combination of events. First of all the root cause. It seems the redundancy did not work, not did the contingency. BA staff had to use Twitter and whiteboard to communicate. Even megaphone were not available at the airports.

Possible  root causes for the IT failure are:

  • major issue with diesel generator
  • cyber attack
  • power failure outside control of British Airways (national power grid)
  • power failure at one of the BA datacenters
  • failed software update to a major component
  • security update affecting all servers of datacenter instead of a few

An issue with backup power feed seems the most likely cause. For some reason the main power feed at the primary datacenter failed. UPS took over the power feed. However the diesel generators for some reason failed to supply power. Could be because there was not sufficient diesel in the tanks while indicated there was (faulty indicator). Could be diesel was old and filters were dirty.

It could be the airconditioning system of the datacenter was not feed by UPS and diesel generator. It could be because of that system became overheated. This seems a bit unlikely as many different systems all failed at the same time,

Likeky a failover to the secondary datacenter either failed, engineers did not dare to failover.

At this moment all guessing.

A cyber attack seems very unlikely. BA denied this being the cause.

There has not been reports about a grid power failure in the Heathrow area.

Reuters reported at May 30

Scottish and Southern Electricity Networks (SEN), which manage the electricity distribution network in the area north of Heathrow where British Airways’ headquarters are located, said its services were running as normal on Saturday morning.

“The power surge that BA are referring to could have taken place at the customer side of the meter. SEN wouldn’t have visibility of that,” a spokesman said.

A power failure inside the datacenter which had a massive impact is possible, despite redundancy. See the Delta Airlines situation in August 2016.

Failed update to a major component seems unlikely as well. First a Bank Holiday weekend is not a good moment to roll out a major update which could potentially cause serious issues. Secondly such an update is unlikely to cause various systems to go down.

An security update affecting all systems seems unlikely. The “when people scanning boarding passes were getting incorrect destinations on the screen” is a bit weird. Logically you would expect software to present a consistent result or an error.

Someone at the Register posted this comment. The comment was initially posted on a forum of the Times. The original article is behind a paywall.

From the IT rumour mill

Allegedly, the staff at the Indian data centre were told to apply some security fixes to the computers in the data centre. The BA IT systems have two, parallel systems to cope with updates. What was supposed to happen was that they apply the fixes to the computers of the secondary system, and when all is working, apply to the computers of the primary system. In this way, the programs all keep running without any interruption.

What they actually did was apply the patches to _all_ the computers. Then they shutdown and restarted the entire data centre. Unfortunately, computers in these data centres are used to being up and running for lengthy periods of time. That means, when you restart them, components like memory chips and network cards fail. Compounding this, if you start all the systems at once, the power drain is immense and you may end up with not enough power going to the computers – this can also cause components to fail. It takes quite a long time to identify all the hardware that failed and replace it.

Press coverage

There has been a lot of press coverage on this failure. However media did not report any usefull detais. They spread doubt by stating the main power did not suffer from a failure. They quoted experts which were no experts on UPS sytems used in datacenters. They started about outsourcing being the cause of the failure.

Hardly and newspaper mentioned the IT failure of Delta Airlines in August 2016. This was caused by UPS issues.

Financial Times quoted for example some experts. These state “is is extremely rare for UPS systems to fail”

However with 1 minute of Google search, this article is found titled “UPS failures continue to be the top cause of data center downtime”. The article provides several recent examples of UPS failure

7 Comments

Add a Comment

Your email address will not be published. Required fields are marked *

Current ye@r *