Real life disasters or how a small fire and aerosol can bring down a hospital IT-room
November 1, 2012 3 Comments
Today I attended an interesting presentation of someone working for a large hospital. His presentation gave some interesting information about a small fire in the IT room of the hospital and the consequences of using aerosol as fire extinguishing.
The hospital had one IT room onsite and a recovery location at about 10 kilometers distance. Data was replicated to this recovery location.
At one night the firealarm went off in the IT-room. The cause is yet unknown but most likely shortcircuit in a powersupply of a server. At that time the hospital was mainly using physical servers. Some smoke came into the room, quickly circulated by the airco and soon after the fire extinguishing system based on aerosol started to do the job. The aerosol employs a fire extinguishing agent consisting of very fine solid particles and gaseous matter to extinguish fires.
Substances Used in Fire Suppression are: Gas, Water and Aerosol. Check this site for more info.
Obviously water is not suited for IT-rooms and datacenters.See this example of a datacenter run by Shaw Communications where an explosion and fire knocked out both the primary and backup systems. Both were located in the same room and protected by a sprinkler system.
Gas needs piping and is expensive. Aerosol is relative cheap and does not use piping. However gas does not not need cleanup while aerosol does!
While this aerosol does it work great, extinguish the fire, it has a nasty side effect for computing equipment. The agent which is a kind of dust, is sucked into all servers because of the cooling fans. Cooling fans do not like dust. They will jam and stop after a while. And servers do not like heath, they need to be cooled when running. Also the aerosol will start some chemical reaction with inside parts like the motherboard.
This all results in servers which will stop running sooner or later. Not a nice situation. Besides that, the airco was switched off to make the aerosol fire extinguishing more effective.
During the small fire nobody was allowed the enter the hospital by the firebrigade. Around 2 hours after the fire was detected the IT-room was declared save. At that time many servers were not useable because of the damage done by aerosol. Others still running were not reliable and could fail anytime soon. So the hospital decided to start using VMware vSphere as soon as possible and P2V the not reliable servers to virtual machines before they would fail.
So while the damage of fire was minimal, the damage done by the aerosol was huge!
So the hospital switched over the the recovery site for most of their applications. They used servers which were running in the test enviroment for production workloads, quickly ordered new servers and shut down less critical applications so the capacity could be used for critical applications. Data was not lost.
The networking equipment was located in another room which was not affected by the fire and damage of aerosol. That was a big luck as all client server connections were routed over this equipment to the recovery site.
The fire broke out at around 00:00. The next morning all critical IT services were up and running and business could go on. If no IT services are available al lot of financial damage is created. The hospital estimated a daily lose of 1 miljoen Euro.
While IT was running in the recovery site, there was no redundancy anymore. All clustered applications were setup such that one node was running in the primary site and the other node in the secondary. Image the recovery site would fail as well. The IT-room was such damaged by smoke all equipment including storage had to be taken out and replaced. When the IT-room was empty it needed to be cleaned first, and then rebuild. This tooks weeks. In the meanwhile a new temporary datacenter was rented for standby reasons.
To overcome situations during failover when there is no redundancy the hospital decided to have two IT-rooms in the hospital and use a third location for recovery located 10 kilometers away from the main building.
The new IT-room does not use aerosol. The insurance company even forbids to use aerosol.
-be aware of the damage when IT-services are not available. Invest in procedures, software and hardware. This is the insurance premium.
-make sure there is a third IT-room available to cover the recovery site during recovery.
-do not use aerosol for fire extinguishing in datacenters
-frequently test your recovery procedures
-separate networking equipment from compute and storage equipment
-always have two IT-rooms onsite which are able to deliver over 100% capacity
The dutch society for safety (Nederlandse Vereniging voor Veiligheidskunde (NVVK)) published an article in their NVVK magazine which describes the danger of aerosol. Download the article here or here.
Some more photos showing the state of servers after the aerosol did its work