Business Continuity Case Study: Lessons learned from a data centre failure

212

The causes of a major data centre failure – in detail

This report, detailing the root causes of a “failed failover” demonstrate just how important the nitty gritty of the physical data centre infrastructure  can be in ensuring a smooth failover.  Failures at this level affect EVERYTHING and without obtaining assurance that the assumed levels of resilience are in place and operating are critical to the reliability of business continuity and recovery arrangements


This report is provided by Steve Dance from business continuity plan consultants and Business Continuity Management Framework  provider, RiskCentric

RiskCentric
Business Continuity Plan Development and Management Systems


It took a full day and half the night to get IT operations back in business at our DR site, and that was only for the highest priority systems. With a portable air conditioner, a temporary line and small uninterruptible power supply, we were able restore the phones. It would take weeks to replace the damage to the main switchboard, but we also had to know what went wrong so it wouldn’t happen again.

Below are the six failure points we discovered, and then noted in our disaster recovery report.

#1: Air conditioners

While extra air conditioners were available, most of them were powered from one switchboard. Only the two redundant units and the uninterruptable power supply (UPS) room unit were on a different power source — an idea that the designer thought was logical, but negated the redundancy we had paid for. The trip current on the main circuit breaker hadn’t been set correctly, and the engineers and contractors had not coordinated the breakers. So when one air conditioner developed a problem, the main breaker tripped instead of the single branch breaker, consequently losing 80% of the cooling. An infrared scan was done on the switchboard, but with only some of the air conditioners running. Without a full load, the bus didn’t seriously overheat, so the loose connection that ultimately exploded wasn’t revealed in testing.

The second switchboard was in the same electrical cabinet as the first one — another decision that was made to meet the budget — so the two power buses were right next to each other. When one exploded, it destroyed the one next to it, and we lost everything.

#2: Data center design

Another item examined in our disaster recovery report was data center design. Since our generator was for the whole building, the transfer switch was in the basement, ahead of the switchboard. It didn’t sense an incoming power failure, but the destroyed switchboard would have stopped us anyway. With a shared generator, we should’ve had multiple automatic transfer switches with the data center on its own switch. That way, if the power were to go out in the data center and the rest of the building was not affected, the generator would start and the data center would receive emergency power.

We objected to the electrical room being accessible from the data center because we didn’t want electricians coming through our compute area. We were ignored. With the electrical room air conditioner still running, and with the data center units shut down, the electrical room was at positive pressure. When the door opened, heat and smoke from the explosion poured out.

#3: Smoke detector issues

The early warning smoke detector picked it up instantly, but it also controlled the gas fire suppression that wasn’t set correctly. So instead of just sounding an alert, it triggered a gas dump as it sensed smoke. The smoke particles also contaminated the filters of all the equipment that was still running. The only good news was that the air conditioner in the electrical room was on the same circuit as the two redundant units, so it kept running. Without cooling, the UPS would have quickly overheated and shut down before the computer room. The UPS should go into bypass and maintain street power to the computers, but testing found the bypass wasn’t wired correctly. With only one air conditioner, we were vulnerable in two ways.

#4: Prioritisation

Our UPS could do an orderly server shutdown through the network, but we never hooked that up because of other priorities. We also learned we didn’t really need that Emergency Power Off button, since we had no raised floor and weren’t using containment. The engineers specified the most dangerous button in the industry “because every data center has one,” but didn’t include any cover or protection to prevent premature use.

Data center administrators are faced with an endless list of tasks. Learn how best to prioritize these responsibilities with strategies that you can actually put to use.

#5: DCIM alerts

When the data center infrastructure management (DCIM) tool was configured to alert on major alarms, only, the limit was based on ASHRAE’s allowable temperature, which was higher than our data center’s actual parameters for cooling temperature. Since cooling was set at the previously recommended temperature — much colder than it should have been — the failure came well before the alert, costing valuable disaster mitigation time.

DCIM also should have shown that eight of 10 air conditioners failed and what had caused it, but we didn’t purchase the mechanical equipment module for the DCIM system, and therefore weren’t alerted about cooling unit failures. This was also noted in our disaster recovery report.

#6: Lack of training and certification

We certainly needed more DCIM training, and the GUI was complex and provided so much detailed data that it was difficult to navigate. We tried to revise the GUI so we could see the big picture more easily, but it was not that configurable.

Response

IT should have been included in the selection of this important system, and tested it similarly to how we benchmark other software before it’s purchased.

We were definitely not Tier III, and a real certification would have revealed all these vulnerabilities. Our company cut too many corners in contracting our backup and DR site, but the failure to develop and test a real plan was down to us. We now test twice a year by actually transferring operations.

What Business Continuity Managers can learn from this incident

There are many take-aways from this incident: the requirement for greater veracity in business continuity and fail-over testing probably being the main one. Many of the weaknesses were introduced during the original commissioning of the data centre, some for budgetary reasons, others because poor decisions were made by contractors.  But perhaps the most important take-away is “don’t rely on desktop walk-throughs to prove your IT failover – actually do the failover in all its aspects”