nSights Talks

Chaos Engineering in Kubernetes

Tutorial Highlights & Transcript

00:00 - Introduction

Hi, my name is Jasmeet Singh and I’m from On-Call Support Services. My topic is Chaos Engineering in Kubernetes. Let’s start.

00:16 - What is Chaos Engineering?

What is Chaos Engineering? Chaos Engineering is a method in which we experiment with our infrastructure by injecting failure events intentionally to check the weak points of our systems or applications. In this way, we can check how our systems behave when there is pressure, or we can identify the bigger bugs and make our systems or applications highly resilient.

00:38 - What is Chaos Mesh?

When we talked about Chaos Engineering in Kubernetes, we ended up at Chaos Mesh. It is a powerful platform for Chaos Engineering in Kubernetes. And it is a cloud-native open source project, which we can easily deploy on our Kubernetes clusters. Chaos Mesh is built on Kubernetes custom resource definitions and provides various types of faults, which we can simulate on our systems. And these are called Chaos experiments. By running these experiments, we can observe various abnormalities that might occur in real-time production environments.

01:13 - What are the components of Chaos Mesh?

So what are the components of Chaos Mesh? Chaos Mesh has three main components. The first one is Chaos Dashboard. Chaos Mesh provides us with a user interface where we can run, query, create, or monitor all the experiments that we run over in our Kubernetes clusters. The second type of component is Chaos Daemon. Chaos Daemon runs on each node and has privileged permissions to access each node’s network storage and file systems and it can also interact with the kernel. And the third component is Chaos Controller Manager. This controller manager is responsible for scheduling and managing the experiments, it manages the CRDs, and controllers, and these are required to run the Chaos experiments. When we install the Chaos Mesh, this component, which is ChaosD, is not installed by default. So this is a service or tool which we have to install separately. For example, if our Kubernetes cluster is not running on a cloud platform, and running on a physical or virtual machine, then to simulate Chaos experiments on that host, we need to install this ChaosD as a service. We can also inject faults on this physical host based on the CRDs.

02:29 - Chaos Experiments

Let’s talk about the Chaos experiments. Using Chaos Mesh, we can simulate different types of fault experiments that might occur in our systems or applications in real-time. And these are categorized into three types. Under the basic resource faults, we can simulate the faults regarding pods like we can kill the pods or we can kill the specific containers in that pod. Also similar to this, we have the StressChaos in which we can make our CPU and memory through high utilization and HTTPChaos, we can make the bad request or we can make the response failure. And there are many more under the Basic Resource Faults. For the Platform Faults, Chaos Mesh currently supports three: the AWSChaos, GCPChaos, and AzureChaos. And for the faults, we can restart the specific node and we can stop the node under the Platform Faults. And for the Application Faults, we have a JVMChaos in which we can simulate some faults regarding the JVM applications. Now let us install this Chaos Mesh in our Kubernetes cluster.

03:40 - Demo - Installing Chaos Mesh in Kubernetes Cluster

For this demo, I have an EKS cluster running and the workload I have two nodes registered with this EKS cluster and I have two different operating systems on these nodes. As you can see here, my first node has Windows operating system and the second has Linux operating system. One thing I would like to tell you is that currently, Chaos Mesh does not support the windows. But it also allows us to run the PodChaos in the Windows nodes. We can only run the faults regarding the pods. We can kill the pods on the Windows nodes. To install the Chaos Mesh, we have two steps. We can execute this simple script that I’m going to execute now. Another step is to use the helm to install this Chaos Mesh in our Kubernetes clusters. As you can see here, different custom resource definitions have started created here and we will simulate some types of faults similar to these custom source definitions. You can see here there is a CRT AWSChaos in which we can run the faults, we can restart the node of our EC2 instance, or can stop that node. There are various types of CRTs that have started creating here. We will have to wait for a few seconds to complete this installation. Okay now, this is creating the services and the deployments for the Chaos Mesh. And we will be good to go in a few seconds. Now the pods are created here. Okay, our Chaos Mesh is installed properly in this EKS cluster, let us first verify if all the pods related to the Chaos Mesh are running fine. Yes, over the pods data to the Chaos Mesh are running fine.

06:00 - Simulating a Network Fault using Chaos Mesh

Let us now simulate a network fault using this Chaos Mesh. I have a simple web application, let me show you the YAML file for that application. I took this example directly from the Chaos Mesh documentation site. Here you can see that I’m using the image and for that command, the target IP here I used is the Kube DNS service IP. Let me first create this application. And in the meantime, let me grab the IP of my node. This is the application. This application is created in such a way that it is showing its latency in milliseconds in this graph. And now using Chaos Mesh, we are going to simulate a network fault in this application. Let me also show you the NetworkChaos file I will use in this demo. This is the network YAML file, the client is the NetworkChaos. And for the action, I’m going to use delay. This means that we are going to delay the latency for 10 milliseconds on the ports which have the label’s app as a web show. We can create this object as we create another object in the Kubernetes clusters by running this simple command. As soon as you create this object, you will observe that our application’s latency will increase to 10 milliseconds. So here we go. This is the whole power of the Chaos Mesh. Now our application is observing the high latency. And by this time, we can take further steps to mitigate these types of issues if they happen in real-time or in production environments. If we want to dig deeper into this Chaos Mesh. I also provide us to stop these kinds of experiments by running a simple command. This is the command. This command will annotate the NetworkChaos that we created here. And after executing this command, you will observe that our application latency will restore to its original state. So here we go, you can see that now our application’s latency is at its original place. This is from the Kubernetes site like Chaos Mesh also provides us the box dashboard.

09:00 - Chaos Mesh Dashboard

Let me show you that dashboard. First, let me get the node port for that dashboard. Okay. And this is the node port for that dashboard. Let me paste it here. Okay, this is the dashboard provided by the Chaos Mesh. One thing to note here is that if we install this Chaos Mesh using the helm command, we get an additional layer of security and we have to give access to this dashboard by creating Kubernetes or RBAC rules.

We can add the users to the roles. There are two roles currently supported in Chaos Mesh. The first one is the visitor role in which we can provide read-only access for this dashboard. And the second role is the manager role in which all the users can run experiments and check all the events in this dashboard.

On the very first page, you can see that we can see all the events or details about the experiments that we run recently in our EKS cluster. The next type is the workflow. If we want to run different kinds of experiments, we can use this workflow option. We can run the experiments in a sequence, and we can run different types of Chaos Mesh experiments in parallel. The next type is the schedule. Schedules are similar to the clone jobs in Kubernetes. We can run experiments at a specific interval of time. The next tab is the experiments. Here, we can see all the experiments that are currently running in our EKS Clusters. You can see here that currently, the experiment is in the paused state. We can get all the details about that experiment. We can also check the events. The next step is the events. Here, we check all the events related to the experiments. And the next tab is the archives. If we want to delete any experiment, then we have to first archive that experiment from the experiment tab. I can click this button ‘archive.’ From the archive section, we can delete that experiment from here. The last tab is for settings. We can change all the basic settings regarding this dashboard.

11:23 - Running an Experiment using the Schedule

Let us now run an experiment using the schedule. I’m going to create a new schedule here. For the experiment type, we get two options. The first one is Kubernetes. And the second one is the host. As I told you in my previous slide, if we want to run the experiments in our physical host or virtual machines, we have to install that ChaosD service separately. I am not going to use this host option. I’m going to use Kubernetes. Here you will get all the details of the Chaos experiments that we can run in our EKS Clusters. I’m going to choose the pod fault. For this experiment, we get three options. We can simulate the pod failure, and we can kill the pods, or we can kill that container in a specific pod. I’m going to choose to kill the pods. The grace period should be zero. Here we have to fill in the schedule info. Chaos Mesh is intelligent enough that it will get all the details about the namespaces or labels and selectors in our EKS Clusters. You can see that all the namespaces are listed here. I’m going to choose the default namespace here. We also get all the labels that are being used in our EKS cluster. This time, I’m going to kill the pods running on the Windows node. These pods which have the label type are Windows pods – these are the Windows pods running on the Windows node. For the mode, we can select ‘All’ if we want to inject this fault on all the pods, we can select all, or we can select the fixed number as we want. I’m going to select it here too. You will also get a preview of all the pods that have this label or type is Windows pod. Here we can define the name of our experiment. I’m going to choose this pod kill. This is the history limit in which we get the details about this experiment here. For deadline segments, I’m going to choose zero. For the schedule, we have to mention the same text like a clone job. I’m going to choose after every one minute, my two pods will be terminated by this experiment whose label type is Windows pod. So we are good here let me submit this. Now our schedule is created here. As we created this schedule or experiment from this dashboard, we can also get the details about this experiment in that terminal. We can list down the schedules that we created recently. You can see here the particular schedule is there. Under the schedule, we have selected the PodChaos so we can also list down the PodChaos here. Let me check it, okay. This is the PodChaos. This means that this will kill the pods in a few seconds. Let me show you the pods in what status. This experiment will kill the Windows pods in a few minutes. We can also check the events here regarding our experiment and we can get the details about the experiment here. Events are showing that this is successful. Yes, here we go. You can see that all the Windows pods in the number of two sequences are getting terminated here. This is the whole power of the Chaos Mesh and for that conclusion, we can use this Chaos Mesh to get the details about the outages or network faults in our real environments. In this way, Chaos Mesh is very useful for us to make our applications highly resilient and highly available. So that is all from my side.

Jasmeet Singh

Senior Support Engineer

nClouds

Jasmeet joined nClouds in 2020 as a Support Engineer. Since then, he has been promoted to Senior Support Engineer.