Business Continuity Case Study: Data Centre Incident

207
business continuity case study

It’s not often someone will candidly share the detail of a real incident.  This rare and detailed  account of how a data centre incident unfolded underlines the criticality for close collaboration between Facilities and IT Infrastructure Management during data centre build and operational PPM activities.  We are hoping to publish the “post mortem” soon.


This report is provided by Steve Dance from business continuity plan consultants and solutions provider, RiskCentric

RiskCentric
Business Continuity Plan Development and Management Systems

It was 3 a.m. and my smartphone gave me an alert. I was getting alarms ten times a night since we installed the new data center infrastructure management system, but none had proven serious. In this case, the temperature in our main data center was within the American Society of Heating, Refrigerating and Air-Conditioning Engineers’ allowable temperature range — but over the company’s operating limit, and rising.

I called security. They were getting the same alarm, but no one was available to check it out. After awakening a facilities manager who said he would get someone there, I got dressed and headed to our building.

Powerless and under pressure

An hour later, I walked into a data center that felt like the Sahara. Lights were flashing everywhere, server fans were at full speed and all but two of our 10 air conditioners were dead. Some servers were already shutting themselves down. Suddenly, the disaster recovery policies I thought we put in place were beginning to crumble.

The data center infrastructure management display was confusing, and the graphical user interface made little sense after getting past the first menu. A table of numbers showed that the temperature had been climbing for several hours. Why didn’t I get an alert earlier? I found an electrical diagram that looked like hieroglyphics, but I could tell it was for our UPS systems. I knew where to find the panels for our server cabinets, but had no idea about the mechanical controls. There were some electrical panels on the walls, but the labels made no sense. “LBTA-3” could have been anything, and the panel doors were locked anyway.

Once the facilities worker arrived, he confirmed what I already knew: There was no power to most of our units. He checked the breakers he could locate and found nothing wrong, but we couldn’t go any further without an electrician. This required another call to the facilities manager, and another wait for the electrician to arrive.

One by one, I shut down servers to avoid catastrophic crashes. Soon the electrician arrived, and he knew where the electrical panels were — in a room behind locked doors that we weren’t able to access without his special key. He opened the door, and it was cool inside. This was also the UPS room, and its dedicated air conditioner was running. A single air conditioner meant our redundant UPS was vulnerable to non-redundant cooling.

Things heat up

Once the electrician reset the tripped main breaker, air conditioners started coming back to life — but not for long. Flames crept through the small cracks around the panels of the electrical box. Our aspirating smoke detection system was supposed to alert us before anything got serious so we could solve it before the main fire protection system dumped gas. It had quickly picked up the smoke drifting into the data center, and ear-splitting alarms were going off. But instead of an early warning, the main system was already starting its countdown to gas release. There was no fire in the data center so I hit the override button, but that only initiated the countdown again. Firemen appeared at the door. It was the air conditioner power that had problems, not the UPS or server power, but they immediately reached for the big red EPO button. I yelled, but they hit it anyway. A few seconds later the gas dumped. The electrician headed for the basement to cut main power to the room, and the firemen poured foam on the burning box.

A cold reception at the DR site

When our overseas offices called me wondering why they couldn’t access the office phones, I assured them that, based on our disaster recovery policies, they would be routed to our DR site. However, although we had contracted the site, we hadn’t actually done a transfer of operations, meaning we hadn’t moved our IT infrastructure — either physically or virtually — to the DR site. When I called the DR provider to declare an emergency, they informed me that the site wasn’t maintained hot and ready to go. We had been doing daily data backups to the DR data center, but it was going to take time to get our user operations transferred. And we were going to need our own staff there to do it

In the electrical room, the fire was out, the power was shut down and we were working under emergency lighting. As the electrician removed panels from the switchboard, he discovered the bus had exploded and taken out the second bus, too. I knew my only option was to get our IT services back in business at the DR site, and to reevaluate our disaster recovery plan.