While managing the infrastructure of multiple customers, our 24/7 Support Services engineers often get bombarded with a proliferation of alerts. It’s critical for them to separate the “signal from the noise” so they don’t succumb to alert fatigue — meaning, becoming desensitized to an overwhelming number of alerts that can result in missed or ignored alerts or delayed responses. Left unchecked, alert fatigue can result in engineer burnout, increased website downtime and costs, and, unfortunately, even an SLA breach.
Delayed response to incidents and alerts can lead to lost revenue, opportunity cost, and brand reputational damage. When an alert or incident occurs, the mission becomes, “resolve the incident before a timeout occurs.” This is where our 24/7 Support Services team utilizes some secret sauce, nCall — our alert and incident management platform. Our support engineers use nCall’s Integrated Runbooks to automate incident management, reduce MTTR, and minimize website downtime on AWS.
Integrated Runbook automation improves efficiency and reduces the risk of human error. The time it saves enables our 24/7 Support Services engineers to focus on developing and improving procedures for tackling recurring issues. Updates to Integrated Runbook documentation are executed seamlessly so that engineers are accessing the most current remediation procedures. Relevant resolution guidance is automatically attached to each incident.
Compare this to the manual process of incident management, which involves digging out a specific runbook, searching for the cause(s), determining the relevant solution from the runbook, applying the remediation, and finally getting the incident resolved. This manual process consumes precious time that can result in higher MTTR. The redundancy caused by this manual process is daunting at best and often devolves into becoming unimaginably frustrated.
nCall gives our 24/7 Support Services engineers a “get-out-of-jail-free” card by avoiding the manual process. No more endless searches or out-of-date runbooks. nCall identifies, categorizes, investigates, notifies, and provides the necessary remediation steps, all based on text processing algorithms and tags.
nCall’s runbooks are integrated with alerting, ticketing, and communication tools like Datadog, Amazon CloudWatch, PagerDuty, New Relic, and OpsGenie to seamlessly sync alert data and streamline workflow.
With nCall’s Integrated Runbooks, engineers can:
- Create a client-specific runbook.
- Configure alerts inside a runbook collated with related causes, remediations and other relevant details.
- Perform a contextual search based on alerts, causes, remediation types, tags, etc.
- Retrieve previously created and validated remediation steps for specific or similar runbook alerts.
- List all created runbooks, then select a specific runbook for viewing, editing, publishing, etc.
For each incident, a View Remediation option appears under the Action column, where the engineer receives the necessary remediation steps retrieved from the respective runbook. The engineer applies the steps, resolves the incident issue, sips her latte, and then has the option to review the entire runbook for a deeper search, if needed.
The nClouds 24/7 Support Services team uses nCall’s Integrated Runbooks to achieve reduced MTTR faster and minimize website downtime on AWS for our customers. Integrated Runbook automation improves efficiency and reduces the risk of human error. No more endless searches or out-of-date runbooks. nCall identifies, categorizes, investigates, notifies, and provides the necessary remediation steps. The time it saves enables engineers to focus on developing and improving procedures for tackling recurring incidents.
Need help with meeting your AWS infrastructure support SLAs? Would you rather refocus your engineers on innovation rather than providing infrastructure support? The nClouds 24/7 Support Services team is here to help you maximize website uptime, performance, and stability and achieve your AWS infrastructure support SLAs at a competitive rate. Contact us.
View this on-demand webinar featuring SRE experts from nClouds & Datadog
How DevOps Teams Use SRE to Innovate Faster with Reliability.