Getting Started with Chaos Engineering

4 min readMay 15, 2021

In this blog, we will be talking about chaos Engineering. To know What is Chaos Engineering first we have to know about Resilience.


Resilience is a system’s ability to recover from a fault and maintain persistency of service dependability when a fault occurs. Anyone who is looking at resilience will usually have a steady-state hypothesis for their system. If that steady state is regained after a fault occurs, then the system is said to be resilient against that fault.

Chaos Engineering:

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. In other words, we can say that It is a discipline that directly addresses system Availability and affects application, Platform, and Infrastructure.

Chaos Engineering is fundamental to increasing the resilience of today’s cloud-native, highly dynamic applications and infrastructure.

It is different from software testing. Chaos Engineering is used for all sorts of requirements and unpredictable situations. From chaos Engineering, we are trying to see how much our entire system resilient, which means how our system reacts when an individual component is failing.

As We always have trust issue with the production Environment:

What happens when a service is not accessible? What happens when an application receives too much traffic? What happens when our application goes down? What happens when there is something wrong with the networking? and many more.
So chaos Engineering helps us to answer all these questions.

Benefits of chaos Engineering:

It helps us to reduce failures and outages while improving our understanding of our system design.

It helps customers so that they are less disrupted by outages as It improves service availability and durability.

At the Business level, It helps to prevent revenue losses and lower maintenance costs.

Chaos Engineering offers more benefits than other forms of Software testing or failure. A failure test examines a single condition and determines whether a property is true or false. Such a test breaks a system in a preconceived way. The results are usually binary, and they cant test a system under unpredictable or unexpected stresses.
Chaos engineering, on the other hand, can account for complex, diverse, and real-world issues or outages. With chaos engineering, we can fix issues and gain new insights about an application for future improvements.

Principles of Chaos Engineering:

  • Build a Hypothesis around steady-state
  • Trigger real-world events
  • Run Experiment in production
  • Automate Experiments in production
  • Minimize blast radius

How Chaos Engineering Works:

All testing in Chaos Engineering happens through chaos experiments. Each Experiment starts by injecting a specific fault into a system. Such as CPU failure, latency, etc.
The step by step basic processes of chaos Engineering are:

  • Define a steady-state hypothesis: Here We analyze the system and choose what failure to cause. The core step of chaos Engineering is to predict how the system will behave once it encounters a particular bug.
  • Trigger real-world events: Here we perform tests using real-world scenarios to check how our system behaves under particular circumstances.
  • Collect metrics and verify Hypothesis: Here, We need to measure our system’s durability and availability. We measure the failure against our hypothesis by looking at factors like impact on latency or requests per second. So that we can verify the resilience of the system.
  • Fix issues: After running an experiment, we should have a good idea of what is working and what needs to be altered. Now we can identify what will lead to an outage, and we know exactly what breaks the system. So we fix it and try again with a new experiment.

Tool for Chaos Engineering


Litmus is a chaos engineering framework for Kubernetes. It provides a complete set of tools required by Kubernetes developers and SREs to carry out chaos tests easily and in a Kubernetes-native way. Litmus can be used to run chaos experiments initially in the staging environment and eventually in production to find bugs, vulnerabilities. Fixing the weaknesses leads to increased resilience of the system. Litmus adopts a “Kubernetes-native” approach to define chaos intent in a declarative manner via custom resources.

It is highly extensible and integrates with other tools to enable the creation of custom experiments. Kubernetes developers & SREs use Litmus to manage chaos in a declarative manner and find weaknesses in their applications and infrastructure.

To know more about the litmus read this blog by Uma Mukkara sir:

If you want to stay in touch with the happenings on Chaos Engineering on Kubernetes? Join #litmus channel on Kubernetes Slack.