Archives

Five fundamental best practices for your
SRE team

23Sep,21 Post Image

In 2003, Benjamin Treynor Sloss, generally credited with coining the term “site reliability engineering (SRE),” was put in charge of running Google’s production team, consisting of seven engineers. Before the DevOps movement, this first team of software engineers was tasked to make Google’s already large-scale sites more reliable, efficient, and scalable. Today, SRE is a […]

View Post

A real-world example: How to set Service-Level Objectives (SLOs)

15Jul,21 Post Image

Here at nClouds, when we onboard a new Site Reliability Engineering (SRE) customer, our SRE team conducts an onboarding workshop that explains SRE concepts. We collaborate with the customer during the workshop to discuss how to define, measure, and track availability and user happiness, including: Defining SLIs (Service-Level Indicators), the metrics that measure compliance with […]

View Post

Getting started with site reliability engineering (SRE)

14Jul,21 Post Image

Benjamin Treynor Sloss (VP Engineering, Google) is generally credited with coining the term “site reliability engineering (SRE)” in 2003 and it’s buzzworthiness has grown in the past few years. Beyond the buzzword, how do you know if your company is ready to implement SRE? Here are some signs that you need to consider getting started: […]

View Post

Fast-track incident management on AWS with interactive dashboards

04Jun,21 Post Image

nClouds’ 24/7 Support Services engineers use Interactive Dashboards (using our tool, nCall) to accelerate incident management. They get rapid, real-time insights on MTTR trends, incident frequency, recurring incidents, and more — vital when difficulties arise and important remediation decisions must be made quickly. These dashboards have embedded analytics that provide valuable insights into the health […]

View Post

How to get on board with 24/7 support that delivers reduced MTTR faster

01Jun,21 Post Image

Today’s customers demand 24/7 availability from online businesses. It’s about keeping customers happy with your website’s uptime and performance. To keep customers satisfied, your engineers need to meet service level objectives (SLOs) — the targets for your system’s reliability. However, we often hear that companies’ in-house support teams are overwhelmed, suffering from burnout, and developing […]

View Post

How 24/7 support with integrated runbook automation reduces MTTR and minimizes website downtime on AWS

28May,21 Post Image

While managing the infrastructure of multiple customers, our 24/7 Support Services engineers often get bombarded with a proliferation of alerts. It’s critical for them to separate the “signal from the noise” so they don’t succumb to alert fatigue — meaning, becoming desensitized to an overwhelming number of alerts that can result in missed or ignored […]

View Post

Accelerate your microservice architecture incident response process using service maps

10Mar,21 Post Image

Recent studies indicate that the cost of IT downtime is between $9,000 – $12,000 per minute, depending on industry vertical, organization size, and business model. That cost includes business disruption, revenue loss, and end-user productivity. To protect SLAs and mitigate downtime, the first approach is to accelerate the incident resolution process and find the root […]

View Post

Tips to reduce alert fatigue and avoid recurring incidents

19Oct,20 Post Image

At nClouds, many of our 24/7 Support Services customers have some pretty aggressive Service Level Agreement (SLA) deadlines. So, we continuously search for strategies to help them separate the “signal from noise.” In this blog post, I’ll provide tips on the strategies we use to help our customers reduce alert fatigue and avoid recurring incidents. […]

View Post

How to aggregate monitoring alerts to reduce alert fatigue

09Sep,20 Post Image

Here at nClouds, we manage the infrastructure needs of many of our customers so that they can focus on building awesome products and delivering value to their customers. Since we are managing the infrastructure of multiple customers, the number of alerts can skyrocket pretty quickly if not managed properly. So we always look for ways to reduce unintended noise to avoid alert fatigue. Alert fatigue […]

View Post

Search Blog

Categories

Recent Posts

Subscribe to Our Newsletter

Join our community of DevOps enthusiast - Get free tips, advice, and insights from our industry leading team of AWS experts.