Blog

Chaos engineering in Kubernetes using Chaos Mesh

Aug 11, 2022 | Announcements, Migration, MSP

Site reliability engineering (SRE) teams often use chaos engineering to proactively prove and improve resilience during fault conditions. At nClouds, we use chaos engineering to experiment with our infrastructure. We check for weak points in our systems and applications by intentionally injecting failure events within controlled environments. In this way, we evaluate how our systems behave when there is pressure. This method also enables us to identify bigger bugs and input corrections that make our systems and applications highly resilient. However, we cannot talk about chaos engineering in Kubernetes without explaining Chaos Mesh.

Chaos Mesh is a powerful platform for chaos engineering in Kubernetes. It is a cloud-native open-source project that is easily deployed on Kubernetes clusters. Chaos Mesh is built on Kubernetes custom resource definitions and provides various types of faults. This enables us to simulate chaos experiments. By running these experiments, we observe various abnormalities that might occur in real-time production environments.

Chaos Mesh has three main components: Chaos Dashboard, Chaos Daemon, and Chaos Controller.

Chaos Dashboard is a user interface where we run, query, create, and monitor all of our experiments in our Kubernetes clusters. Chaos Daemon runs on each node. It has privileged permissions to access each node’s network storage and file systems, and it can also interact with the kernel. The Chaos Controller Manager is responsible for scheduling and managing the experiments. It manages the CRDs and Controllers. CRDs (Custom Resource Definitions) and Controllers are required to run the chaos experiments.

Because it assumes that Chaos Mesh is running on a cloud platform, the component ChaosD is not installed by default. So, if our Kubernetes cluster is running on a physical or virtual machine, then ChasosD as a service must be installed separately to simulate Chaos experiments. We can also inject faults on this physical host based on the CRDs.

Chaos experiments

We can use Chaos Mesh to simulate three types of fault experiments that might occur in our systems or applications in real-time. The Basic Resource Faults enable us to simulate faults regarding Pods, network, stress, DNS, etc. For example, we can kill the Pods or the specific containers in a Pod. Similarly, StressChaos enables us to generate high CPU and memory stress on Pods, and with DNSChaos we can make bad requests or create response failures. And there are many more tests available under the Basic Resource Faults.

For Platform Faults, Chaos Mesh currently supports three test types: AWSChaos, GCPChaos, and AzureChaos. The Application Faults provides JVMChaos with which we can simulate faults regarding JVM applications.

Benefits of chaos engineering:

Business benefits

  • Prevents lengthy outages and data loss.
  • Helps avert significant losses in revenue due to downtime.
  • Enables companies to scale quickly without losing the reliability of their services.
  • Improves user experience with less interruption and high service availability.

Technical benefits

  • Provides insights from chaos experiments to reduce incidents, MTTD, and MTTR.
  • Gives the team an increased understanding of system modes and dependencies.
  • Enables building a more robust system design.
  • A chaos test serves as excellent on-call training for the engineering team.

While Chaos Mesh currently does not fully support Microsoft Windows, it does allow us to run PodChaos in Windows nodes. We can run the faults regarding the Pods, enabling us to kill the Pods on the Windows nodes.

Now, let’s install Chaos Mesh in our Kubernetes cluster and go deeper into its capabilities:

Check out our on-demand webinar and blogs for additional insights on Site Reliability Engineering:

On-demand webinar:

How DevOps Teams Use SRE to Innovate Faster with Reliability

Blog Posts:

Getting started with site reliability engineering (SRE)
A real-world example: How to set Service-Level Objectives (SLOs)
Five fundamental best practices for your SRE team
Tutorial: How to automate a runbook to reduce MTTR

We’d love to help you to get started with Site Reliability Engineering. Learn more about nClouds’ SRE Services here, or Contact Us today.

GET SUBSCRIBED

nClouds
nClouds is a cloud-native services company that helps organizations maximize site uptime, performance, stability, and support, bringing out the best of their people and technology using AWS