Site reliability engineering (SRE) teams often use chaos engineering to proactively prove and improve resilience during fault conditions. At nClouds, we use chaos engineering to experiment with our infrastructure. We check for weak points in our systems and applications by intentionally injecting failure events within controlled environments. In this way, we evaluate how our systems behave when there is pressure. This method also enables us to identify bigger bugs and input corrections that make our systems and applications highly resilient. However, we cannot talk about chaos engineering in Kubernetes without explaining Chaos Mesh.
Chaos Mesh is a powerful platform for chaos engineering in Kubernetes. It is a cloud-native open-source project that is easily deployed on Kubernetes clusters. Chaos Mesh is built on Kubernetes custom resource definitions and provides various types of faults. This enables us to simulate chaos experiments. By running these experiments, we observe various abnormalities that might occur in real-time production environments.
Chaos Mesh has three main components: Chaos Dashboard, Chaos Daemon, and Chaos Controller.
Chaos Dashboard is a user interface where we run, query, create, and monitor all of our experiments in our Kubernetes clusters. Chaos Daemon runs on each node. It has privileged permissions to access each node’s network storage and file systems, and it can also interact with the kernel. The Chaos Controller Manager is responsible for scheduling and managing the experiments. It manages the CRDs and Controllers. CRDs (Custom Resource Definitions) and Controllers are required to run the chaos experiments.
Because it assumes that Chaos Mesh is running on a cloud platform, the component ChaosD is not installed by default. So, if our Kubernetes cluster is running on a physical or virtual machine, then ChasosD as a service must be installed separately to simulate Chaos experiments. We can also inject faults on this physical host based on the CRDs.
We can use Chaos Mesh to simulate three types of fault experiments that might occur in our systems or applications in real-time. The Basic Resource Faults enable us to simulate faults regarding Pods, network, stress, DNS, etc. For example, we can kill the Pods or the specific containers in a Pod. Similarly, StressChaos enables us to generate high CPU and memory stress on Pods, and with DNSChaos we can make bad requests or create response failures. And there are many more tests available under the Basic Resource Faults.
For Platform Faults, Chaos Mesh currently supports three test types: AWSChaos, GCPChaos, and AzureChaos. The Application Faults provides JVMChaos with which we can simulate faults regarding JVM applications.
While Chaos Mesh currently does not fully support Microsoft Windows, it does allow us to run PodChaos in Windows nodes. We can run the faults regarding the Pods, enabling us to kill the Pods on the Windows nodes.
Now, let’s install Chaos Mesh in our Kubernetes cluster and go deeper into its capabilities:
Check out our on-demand webinar and blogs for additional insights on Site Reliability Engineering:
Getting started with site reliability engineering (SRE)
A real-world example: How to set Service-Level Objectives (SLOs)
Five fundamental best practices for your SRE team
Tutorial: How to automate a runbook to reduce MTTR