Optimizing the on-call process is as important as fixing repetitive incidents. The on-call process must eliminate the need to interact with multiple systems and enable engineers to provide effortless feedback on incidents. In this blog post, we’ll talk about two improvements we made to our on-call process to reduce alert fatigue and simplify post mortem.
nClouds is a AWS managed service provider. Our team is on-call 24×7 to handle critical issues as they arise. We like PagerDuty because we can notify whoever is on call, instead of everyone. However, some of our clients send Nagios alerts to their email addresses in addition to sending them to email. I know it defeats the whole purpose of PagerDuty if you are still sending alerts to emails, but every company has existing workflows around emails. Rather than waiting for migration away from email notification, the faster option is to send notifications to Pagerduty and keep the existing email notification.
But this introduces another problem: every time we acknowledge an alert in PagerDuty, we also have to acknowledge it in Nagios. We normally also comment on the issue, once in PagerDuty and once in Nagios. As you can see, this system creates double work for the on-call engineer.
PagerDuty has a great article on how to achieve two-way integration with Nagios. So, once we acknowledge alerts in PagerDuty, PagerDuty acknowledges the alerts in Nagios and everyone stops getting notifications. The engineer can then focus on fixing the issue.
By the way, if you use the script from PagerDuty, once you acknowledge the alert in PagerDuty, it doesn’t send back acknowledgement email to the contacts. If that’s important for you, you can download our updated version, which sends back acknowledgement notification to email contacts.
In the official article, instead of step 9 use the following:
The above CGI script enables notifications by overriding the $notify flag and setting it to 1.
How to do effective post-mortem?
Receiving the alerts for the second time is one too many. Allowing the on-call engineer to take notes at the time of the incident works well because, usually, after the incident, we forget how frustrated we were at that time. So, at the end of the week, we can review the alerts with the notes to take action. PagerDuty has the option to add notes to the incidents. But, the notes we add to incidents will NOT be added to the reports when we pull the incidents reports from the UI into PagerDuty Analytics
We wrote a script which downloads the incidents with the comments, each week we review the alerts and the comments, and figure out how to stop these issues to occur at first place.
Steps to use the script:
Create a Read-only Pagerduty API Key
- Login to Pagerduty and goto Configuration → API Access
- Click on Create New API Key and create a new Read-only API Key
Pulling Reports with Notes
- Clone the below repo. It has a python script to pull the notes down to reports
- Install the requirements. Using virutal env is optional. Follow the instructions on the readme to get the script working.
- Now download the reports to the same folder as incidents.csv and run the script.
Analytics → Incidents → Download as CSV on the required date.
- reports.csv will be created with notes as the last column.
We would love to know what are some of tips and tricks you use to build better culture around on-call. If you need 24×7 support for your infrastructure, feel free to contact us, we’ll reach back to you shortly.