nSights Talks

Amazon EKS Optimizations (Part I)

Tutorial Highlights & Transcript

00:00 - EKS Optimizations Intro

Hello, everyone, my name is Parth Vyas and today I’m going to give a demo on EKS optimization and this will be part one of the series. So let’s start. From the name EKS optimization, we will get one question why do we need optimizations on the EKS? For that, we have a few common problems that we need to tackle on our own. In this demo, I’m going to discuss two of the many problems.

00:49 - Addressing the IP Exhaust Problem

I will start with the first – the IP exhaust problem. I will explain what is the problem. Basically, EKS provides each and every podIP address from the VPC because by default EKS comes with the VPC, CNI, and VPC CNI assigns each and every podIP address from the VPC. What happened is, most of the time, people use a smaller subnet for the EKS worker nodes. And it consumes many IP addresses, so there will be lots of pods running on it, and it will exhaust the IP addresses. As per my suggestion, we should always provide larger subnets for the EKS worker node so that we don’t have to manage the subnets and networking. If a client is deploying an EKS cluster with the VPC CNI. Eventually, an organization grows, and at that time, you have to add more subnets as per the AWS, whenever there is a new subnet, for each and every subnet AWS results in five IP addresses. If you have a greater number of subnets then a greater number of IP addresses will be wasted. The best practice is to create a larger subnet.

For this demo, I have used three availability zones in the US East one in North Virginia. As we want to create the three private subnets for the EKS and we also need a CIDR range for the public subnets and the database subnets. For that, I split the /16 CIDR range into four and make up four larger subnets and after the four larger subnet CIDR range, the first CIDR range will be 10.0.0/18. I kept it for the public subnet, database subnets, and different kinds of data layer and content layers, front-end tier subnets, and kept this private CIDR – this 64.1.28.192 CIDRs for the EKS worker nodes. That way, I will have enough IP addresses for the public subnets and database, also. Even EKS will also have enough subnets. As for the Solutions Architects and DevOps, they need to make a decision based on the future, also. Based on that, AWS VPC supports a total number of 5/16 CIDR ranges and if we convert it into the IP addresses allowed to allocate is around 329k IP addresses. If a client has more required IP addresses than 329k, then we have to split up the EKS cluster also into the multiple VPC and multiple EKS clusters If the client wants to stick with a single VPC and single EKS cluster only, for that kind of scenario, we don’t have to use VPC CNI, because it’s not supported. And for that kind of scenario, we have to go with overlay-based CNI and some of the best CNIs available are Calico and Selenium. Calico is the most efficient and fastest CNI load for the overlay right So, this concludes the first IP address issue.

05:05 - Addressing the Worker Node Storage Full Problem

And now I will go with the second problem which is the worker node storage full. Even if your organization is small or your application is smaller, as time goes on and when there is development going on, all the new commits and new container images will be pulled and the application will be deployed multiple times a day, or multiple times a week on a cluster. At that time, your storage also will be being used and it will be full. For that kind of scenario, by default Kubernetes comes with the container image cleanup system and by default, it’s on 80%. So, once your storage is being used by 80% then it will start the cleanup process but if there is faster development and lots of container images are being pulled and deployed on the cluster then the Kubernetes cleanup system is too slow compared to that. For that kind of heavy use, I come across the configuration in the kubelet and it’s the kubelet that handles that instance container image cleanup system. So for the DICOM with the configuration we should use a higher threshold so it’s over 60%. So if there is 60% storage used, it will start the cleanup process. If it’s a faster development, it will still have that 40% storage left. Until then it will start the cleanup process and here I have used a lower threshold of the 40%. The reason behind this is that if we want to rollback we also need some of the older container images available in that node. The rollback process will be faster. For that I have used 40% and with this configuration, I use around five or six Kubernetes EKS Clusters once I set this up, I don’t have to check the storage anymore because kubelet is taking care of it and we can say plug and play and setup and done configuration.

Parth Vyas

DevOps Engineer

nClouds

Parth is a DevOps Engineer at nClouds with multiple AWS certifications including AWS Certified Solutions Architect - Professional, AWS Certified DevOps Engineer - Professional, AWS Certified Developer - Associate, and AWS Certified SysOps Administrator - Associate.