In 2003, Benjamin Treynor Sloss, generally credited with coining the term “site reliability engineering (SRE),” was put in charge of running Google’s production team, consisting of seven engineers. Before the DevOps movement, this first team of software engineers was tasked to make Google’s already large-scale sites more reliable, efficient, and scalable. Today, SRE is a […]
View PostHere at nClouds, when we onboard a new Site Reliability Engineering (SRE) customer, our SRE team conducts an onboarding workshop that explains SRE concepts. We collaborate with the customer during the workshop to discuss how to define, measure, and track availability and user happiness, including: Defining SLIs (Service-Level Indicators), the metrics that measure compliance with […]
View PostBenjamin Treynor Sloss (VP Engineering, Google) is generally credited with coining the term “site reliability engineering (SRE)” in 2003 and it’s buzzworthiness has grown in the past few years. Beyond the buzzword, how do you know if your company is ready to implement SRE? Here are some signs that you need to consider getting started: […]
View PostnClouds’ 24/7 Support Services engineers use Interactive Dashboards (using our tool, nCall) to accelerate incident management. They get rapid, real-time insights on MTTR trends, incident frequency, recurring incidents, and more — vital when difficulties arise and important remediation decisions must be made quickly. These dashboards have embedded analytics that provide valuable insights into the health […]
View PostToday’s customers demand 24/7 availability from online businesses. It’s about keeping customers happy with your website’s uptime and performance. To keep customers satisfied, your engineers need to meet service level objectives (SLOs) — the targets for your system’s reliability. However, we often hear that companies’ in-house support teams are overwhelmed, suffering from burnout, and developing […]
View PostWhile managing the infrastructure of multiple customers, our 24/7 Support Services engineers often get bombarded with a proliferation of alerts. It’s critical for them to separate the “signal from the noise” so they don’t succumb to alert fatigue — meaning, becoming desensitized to an overwhelming number of alerts that can result in missed or ignored […]
View PostRecent studies indicate that the cost of IT downtime is between $9,000 – $12,000 per minute, depending on industry vertical, organization size, and business model. That cost includes business disruption, revenue loss, and end-user productivity. To protect SLAs and mitigate downtime, the first approach is to accelerate the incident resolution process and find the root […]
View PostAt nClouds, many of our 24/7 Support Services customers have some pretty aggressive Service Level Agreement (SLA) deadlines. So, we continuously search for strategies to help them separate the “signal from noise.” In this blog post, I’ll provide tips on the strategies we use to help our customers reduce alert fatigue and avoid recurring incidents. […]
View PostHere at nClouds, we manage the infrastructure needs of many of our customers so that they can focus on building awesome products and delivering value to their customers. Since we are managing the infrastructure of multiple customers, the number of alerts can skyrocket pretty quickly if not managed properly. So we always look for ways to reduce unintended noise to avoid alert fatigue. Alert fatigue […]
View PostTop takeaways: AWS Managed Microsoft AD and Microsoft Active Directory
2022-12-05 15:25:16