Think of a vaccine that you inject yourself with a tiny amount of a potentially harmful microorganism in order to gain immunity against it. Chaos Engineering is a methodology we use to acquire immunity in our distributed systems by injecting harm (like lack of memory, host failure, or network attacks) in order to detect and mitigate potential vulnerabilities.[1]
System failures can cause huge monetary losses for companies. Even short downtimes can affect a company's bottom line, so the cost of downtime is being considered a KPI for many software engineering teams. In 2016, an ITIC’s survey found that 98% of organizations said a single hour of downtime would cost their business over $100,000.
Chaos Engineering is an approach developed at Amazon and Netflix that aims to improve the resilience of a system. By designing and executing Chaos Engineering experiments, you can learn about weaknesses in your system and then address those weaknesses proactively.[2]
It is a general practice building system confidence through testing. But is it enough? Functional tests are simple to perform and automate at the unit and integration testing levels, but what about the non-functional behaviors of distributed systems? These only can be properly tested in production.
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
Principles of Chaos Engineering
Before adopting Chaos Engineering, you need to answer this one question: Is your system resilient to real-world events such as service failures and network latency spikes? If you know that the answer is "no", then you have some work to do before applying the Chaos Engineering principles.[2:1]
Phases of Chaos Engineering
Build a Hypothesis around Steady State Behavior
First, let's define what is steady state. Steady state is the "normal" behavior of your system. The steady state is represented by performance metrics usually correlated to customer success, like the overall system’s throughput, latency percentiles, etc.[3]
Then, let's build a hypothesis.[4] What if...?
- "What if the CPU becomes overloaded?"
- "What if a host goes away?"
- "What if the database becomes slow?"
- "What if latency increases by 400ms?"
- "What if Redis stops?"
Plan & Run Experiments
Now it's time to plan the experiment! Pick a hypothesis and define the scope of the experiment. For instance, the experiment can impact a certain (small) number of users.
Monitor your system and identify metrics. Without visibility into your system’s behavior, you won’t be able to draw conclusions from your experiments. And of course, notify the organization about the experiment.[2:2][3:1]
As a rule of thumb, you should start with small experiments with a reduced scope. The experiment should run in an environment as close as possible to production, like a staging environment, before running in production.[4:1]
Use tools to help take control and minimize the “blast radius” of the failure injection. And the most important, have an emergency STOP, i.e., a rollback plan.[1:1][4:2]
Remember: It isn't failure! It's data!
Check the results and learn
After running the experiment, it's time to evaluate its results.
Some common questions to address are [4:3]:
- How long to detect the failure?
- How long for the degradation to begin?
- How long for self-recovering to begin?
- How long to fully recover?
- How long to go back to a steady state?
Important to remember: Failures are caused by multiple faults! So, don't blame that one person... :)
Fix
After checking the results, either you’ve verified that your system is resilient to the failure you injected, or you’ve found an issue you need to fix. Both of these are good results. Is there some fix to do? What are you waiting for? Do it now! ;)
Automate!
The process of running the experiments can be onerous and demand a lot of manual work. You can make use of automation tools to perform the tests without taking up too much of your time.
What are some good Chaos Engineering tools?
Big challenges to Chaos Engineering
Chaos Engineering requires a big cultural change. Some thoughts or behaviors that difficult its adoption include [4:4]:
- Teams have no time or flexibility to simulate failures
- Teams already investing all of their energy fixing things
- Might drive to profound discussions
- Chaos engineering experiments can show that the architecture is not as resilient to failures as originally predicted
A cultural change can come when the benefits are perceived.
What are the customer, business, and technical benefits of Chaos Engineering?
Client Benefits
The increased availability and durability of service mean no outages interfere with their daily lives.[1:2]
Business Benefits
Chaos Engineering can help prevent tremendous losses in income and maintenance costs, create more joyful and engaged engineers, enhance on-call training for engineering teams, and strengthen the incident management process for the whole organization.[1:3]
Technical Benefits
The insights from chaos experiments can lead to a reduced number of incidents, reduced on-call burden, expanded comprehension of system failure modes, enhanced system design, quicker meantime for incident detection, and reduction in repeated incidents.[1:4]
Find out more!
Pavlos Ratis developed a GitHub repo named "Awesome Chaos Engineering", a curated collection of Chaos Engineering resources. There you can find tools, books, papers, conferences, meet-ups, blogs, newsletters, forums and more.[1:5]
So, would you break it?
I hope I gave you some good reasons why you should break your systems on purpose.
We all need to be proactive in our efforts to figure out the weaknesses of our systems and fix them before they break when least expected.
“Chaos doesn't cause problems. It reveals them.”
Nora Jones, Senior Chaos Engineer, Netflix [5]