Blog

Best practices for effective incident management on AWS

May 6, 2022 | Announcements, Migration, MSP

Incident management is the technique used by IT and DevOps teams when responding to any unplanned incident or interruption. An incident is any event that requires an immediate response from the operations team. Incident management intervenes to restore services to their operational state. Incident management is a necessity in today’s always-on competitive world.

Why use incident management best practices?

Today’s world is fast, dynamic, competitive, and results-driven. Or, some might say that society is like a child who only knows instant gratification. For companies to succeed, whether online or offline, their supply chain is their livelihood. Speed, reliability, and consistency are the order of the day. This is especially true for the online world. Every company is looking for a system, technique, or method to enhance its quality and reliability. They want their services to run all the time without any delay or obstruction. In today’s online world, incident management is vital.

The issue is not a question of whether a compute incident will happen, it is a question of when – and how prepared your organization is to respond. An incident can happen anytime without prior hints or knowledge. That is why incident management response teams with best practices are of paramount importance. Properly trained incident management response teams decrease an organization’s downtime costs.

The goal of every incident management response team is to mitigate problems quickly (reduce MTTR) and restore operational services expeditiously. Reducing MTTR (Mean Time To Repair) is the average time it takes to repair a system (including both the repair time and any testing time) until the system is fully functional again.

There are various types of incident management processes that vary depending on the type of organization. There is no single process applicable that fits all organizations, but all organizations should structure their incident management training and responses based on AWS best practices. Some guidelines that emulate AWS best practices for incident management are to create robust workflows that identify most major incidents, automate where possible, review past critical incidents, provide multi-channel support, and communicate effectively.

On-call procedures that the nClouds 24/7 support team follows.

Incident prioritization. One of the first steps for our incident response team at nClouds is to prioritize incidents. This takes experience. Incident prioritization is essential because alert fatigue is real and a major cause of burnout.

Postmortems. Along with prioritization, response teams must implement blameless postmortems. In a blameless postmortem, it’s assumed that every team and employee acted with the best intentions based on the information they had at the time. Instead of identifying — and punishing — whoever screwed up, blameless postmortems focus on improving performance moving forward.

A postmortem is a process of learning from past incidents, including outages and downtimes. It involves doing a detailed analysis of each step, beginning with where the incident occurred, from its start time till it was resolved. On that basis, we can implement a set of actions to prevent a similar incident from recurring.

Training builds confidence. One of the most important factors involved in creating a great response team is to give your team of engineers quality training and the security of knowing that their backs are covered. Your engineers may be hesitant to respond if they are afraid of failure or are insecure due to the lack of knowledge of the services. Avoid this by developing a proper training program for the new engineers on the team. Provide them with all the documents and up-to-date runbooks so they can quickly and easily find the exact remediation for each incident, and take the appropriate action without any doubt or fear.

Fair work schedules. It is crucial to have rotations in their shifts. In other words, take a qualitative approach and distribute work fairly. Because if we only look at the time spent on a call, we can’t get an accurate overview of who is most likely to develop alert fatigue or who has less workload according to their shifts. So, it is essential to adjust schedules as needed to distribute work fairly.

Runbook currency. Also, keep your runbooks up-to-date. Runbooks are predefined techniques for operational use. Runbooks help check for codes to determine which people or teams are required to be escalated based on types of issues. Incident postmortem reports and retrospective processes are also mentioned and associated with the specific kind and severity of the incident. There are various runbook requirements per AWS, like procedural steps and expected outcomes, escalation procedures, etc.

Remember, once an incident is identified and reported, employees or workers must know where to call or report the issue to take immediate action. Therefore, the organization’s members need to know how critical the issues are and the steps required to handle those incidents. There are many ways to determine the severity of an incident, but the best way is to first analyze the impact on your client.

For more information about automating runbooks and reducing MTTR, check our blog posts:

Tutorial: How to automate a runbook to reduce MTTR

How 24/7 support with integrated runbook automation reduces MTTR and minimizes website downtime on AWS

How to get on board with 24/7 support that delivers reduced MTTR faster

Need help with 24/7 Support or Site Reliability Engineering? nClouds helps keep your systems fast and reliable, with maximum uptime as they scale — so your engineers can focus on innovation.

Contact Us