nSights Talks

Reduce AWS CloudFormation Frustration

Tutorial Highlights & Transcript

00:00 - nSights AWS CloudFormation ( introduction )

I know infrastructure as code is something that we work on a lot. It’s a cornerstone of how we operate for reusability and removing any errors. And there’s a frustration that I felt. And I’m sure a lot of other engineers that are working on CloudFormation would have felt, as well. So I’m going to demo a new feature that they uploaded.

00:38 - Infrastructure as Code (IaC): The bricks of the ecosystem

Perfect. So as I mentioned, infrastructure as code is a cornerstone and the founding brick of how we architect our infrastructure, once we scope out what needs to be done. The first, usually in any project, is to create infrastructure as code templates, so that we can create a standardized, best practice valid system. And there’s also a lot of focus internally as well with nCode to setting up a standard library of infrastructure as code that has collaboration from the best engineers on hand, so that any system that we can create is reusable and up to the best standards that we and AWS want to put forward. It also removes the human error and brings the reusability of architecture like I mentioned, AWS’s preferred option of infrastructure as code is CloudFormation.

01:27 - AWS CloudFormation is AWS's preferred IaC option

It’s their own service that they provide that a lot of new cool features have gone into. Basically, you write a text in the form of a YAML, or JSON file, in which you scope out what the resources are going to be, what the characteristics are, are there any constraints or any conditions that need to be applied. And once you run that through the console, AWS will then provision those resources identically each time it is run. However, it’s a good system.

01:58 - Frustration: Waiting for AWS CloudFormation to roll back the stack to the previous stable condition

But one of the things that I’ve personally experienced and a lot of engineers have done as well is that when you start from scratch, like you’re starting a new stack, since there is no stable version for a CloudFormation stack to roll back on, should there be an issue with the stack itself, like an error or a typo, or there’s just a resource that’s not been configured properly, it happens when, when you’re creating, you know, complex infrastructure and resource isn’t properly mapped and causes an error. Any resource that has been created up to that point, gets deleted. And if you’re running a complex stack, which has a lot of resources, then that means that there’s a lot of resources that are going to get deleted, some resources take a lot more time, compared to other resources. For example, if there’s an RDS or a Lambda function that is leveraging a network adapter or network resources, in my experience, those take a lot of time, like up to 10-15 minutes for a decent sized stack to get deleted. So in a research environment, when you’re trying out something new when you, you know, push a stack for deployment, and there’s an issue, the standard thing that we’ve seen is that there’s a lot of weight for those resources. And when, like, for example, in a scenario, you make some changes to the stack, you push it again, and again, there’s an error with some different resource. So now all those resources are going to get deleted again. This is a frustration that causes a lot of basically idle time where we’re just you just waiting for that stack to be reusable again. And it might not even be something that is a fault of the CloudFormation stack itself. There could be a resource like something outside of the stack, like for example, you don’t have the correct IAM privileges, something that takes like 30 seconds to resolve. But given the fact that this CloudFormation stack is deleting itself it is going to take 15 minutes to resolve that. You have to wait those 15 minutes. And it also might be that the service quota has been fulfilled. So you know, that’s not something that the stack itself can fix, but it’s an issue that will cause that stack to fail.

04:07 - Accelerate error remediation with AWS CloudFormation's feature, retry stack operations from the point of failure

So AWS recently on 30th August, AWS announced a new CloudFormation feature. It’s called retrace stack operations. Basically, what it allows you to do is that it gives you another option to select when you’re provisioning a stack. And any resource that gets created up to the stack failure point, AWS will keep those resources created, they won’t go back to them, even if there’s no stable stack version for a CloudFormation template. And it gives you a couple of different ways in which you can attempt a CloudFormation stack read the update like it gives you have an option to update the stack so you can go back actually fix the CloudFormation template and rerun it and it won’t make changes to the resources that have been created previously. They will just start from the point of failure. It also allows you to just do a retry, so that if there’s any issue with a CloudFormation that resulted from access issues or something from outside the stack, you can just quickly resolve them and hit retry. And it’ll start from the exact point of failure. So I tested this out on the AWS, nClouds account. Let me show you real quick how this works.

05:26 - Demo of using retry stack operations from the point of failure feature

So I’m going to go to the CloudFormation template service. In the CloudFormation stack that I created, I have put in a Kinesis stream as well as an S3 bucket and a queue. And since I wanted to be a bit naughty for the first part of the demo, I did a couple of things wrong. I didn’t set a min and max value for the number of shards that can be set up in a Kinesis stream so that it can fail for the CloudFormation template. So a Kinesis stream cannot be created with zero shards, it needs at least one. But since I haven’t put in that best practice in the template itself, it will let me go forward with this zero one. Put tags in so my resources don’t get destroyed. Perfect. So here’s the actual feature that I talked about – the stack failure options. Previously, this feature was in here. So this is the old one, roll back all stack resources, but you can actually select this: preserve successfully provision resources. A simple change, but it does a lot of benefit to, you know, if you’re doing a first time stack. So if you would have gone with the first, you know, the older method without this new added feature. These three things are being created, on my queue, a stream and a bucket. So you can actually see that the stream failed, because the shard count failed the minimum constraint for the number of shards, but it created my resources, the bucket and the queue. And it fails on points here that create failure. So if this was a significantly larger, more complex CloudFormation template that had databases, network resources, the stack wouldn’t have to delete everything, it will just fail on the point of failure for that particular resource. And this is the prompt that it gives you. It gives you that you know, all successfully provision resources are alive. And you can choose to retry your stack if there’s like if there’s a permission issue or a service issue, or you can go back and update the stack and even roll back if you’re not, you know, if you see something that might be critical, and you just want to start from scratch, it also gives you that option as well. So I’m going to go update and replace the current template, choose a new file, this one has the minimum maximum value per in. So if I go in now, and actually show you the so if I go with zero again, it will not allow me because there is a parameter that I updated in the stack. So if I go with a valid choice, let’s say two.

So now it updates the stack, but it will start from the point of failure, it won’t recreate the resources that are already up and synced to the stack, it will just try to recreate the resource that was failed and any dependencies that it has. So I found it pretty helpful. I’ve already used this not only for this demo, but for a stack that I was creating for a personal project. So the project itself consists of DynamoDB, RDS, S3, quite different interlinked resources and since it’s a trial and error personal project, I have routinely messed up with it. And every time I messed up, it would take me five to eight minutes to see all the resources that get deleted. Resources that I know are okay that don’t need to be deleted, but they get deleted. So I plugged that in here and I’m immediately seeing like, you know, productivity increase where every new feature update that I do doesn’t take me 15 minutes of waiting. It just updates it from the point of failure. So this is updating. I’ll show you the progress of failure in a minute, once it creates that stream, but yeah, that was my demo. The additional information on how I used it in my own project can be found on my blog. I did a small write up of it just adding in my thoughts and input on what this feature is. Also, just wanted to point out, I know that there are some teams here that are also using AWS CDK to do infrastructure as code. And this feature is also going to be rolled out to them in like a couple of weeks. So that was my demo.

Saad Lodhi

DevOps Engineer

nClouds

Saad joined nClouds in 2018 as a Senior Solutions Architect. He holds several AWS Certifications including Big Data - Specialty, Solution Architect - Associate, Developer - Associate, and Cloud Practitioner.