Improve Business Resilience – let the monkeys loose!

let the monkeys loose

What if someone at your place of work deployed a service that deliberately kills random servers and processes in your server farm. Would you question the sanity of your IT manager, or would you admire their dedication to ensuring service reliability?  Many would think that’s utter madness to introduce a tool that deliberately tried to disrupt your IT infrastructure.  Yet the true path to resilience could be to put yourself under constant attack.

This post was created by Steve Dance, Managing Partner of RiskCentric

Take Netflix, for instance, who have actually introduced software, called “Chaos Monkey” that constantly tries to fail servers at their cloud provider.  This blog post provides a great insight into this concept:

“We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach.

What is Chaos Monkey?

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support. The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don’t, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.

Why Run Chaos Monkey?

Failures happen and they inevitably happen when least desired or expected. If your application can’t tolerate an instance failure would you rather find out by being paged at 3am or when you’re in the office and have had your morning coffee? Even if you are confident that your architecture can tolerate an instance failure, are you sure it will still be able to next week? How about next month? Software is complex and dynamic and that “simple fix” you put in place last week could have undesired consequences. Do your traffic load balancers correctly detect and route requests around instances that go offline? Can you reliably rebuild your instances? Perhaps an engineer “quick patched” an instance last week and forgot to commit the changes to your source repository?
There are many failure scenarios that Chaos Monkey helps us detect. Over the last year Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don’t happen again.

This approach gives a real insight into how complex, high availability IT infrastructure needs a different approach to resilience management and assurance. Periodic testing of recovery and failover arrangements may not be enough, systems need to be constantly under duress to ensure their resilience and stability.  Does this approach work? It seems to have helped Netflix during the highly publicised outage at Amazon web Services (which is used by Netflix for hosting it’s video on demand service). The outage took down or severely hampered a number of popular websites that depend on AWS for hosting, Netflix however came through relatively unscathed – because the constant testing performed by Chaos Monkey helped them engineer their systems and services for failure.
Interesting to note, when we talk about the “Nirvana” of having Business Continuity and Resilience embedded in the organisation – it does not get more embedded than this approach to “always on exercising”.