Site reliability engineering (SRE) teams often use chaos engineering to proactively prove and improve resilience during fault conditions. At nClouds, we use chaos engineering to experiment with our infrastructure. We check for weak points in our systems and applications by intentionally injecting failure events within controlled environments. In this way, we evaluate how our systems […]
View PostIncident management is the technique used by IT and DevOps teams when responding to any unplanned incident or interruption. An incident is any event that requires an immediate response from the operations team. Incident management intervenes to restore services to their operational state. Incident management is a necessity in today’s always-on competitive world. Why use […]
View PostWhy container monitoring is critical for modern cloud environments Modern cloud application environments are complex, running across hundreds or even thousands of compute instances. Because of this complexity, modern applications require container monitoring to continuously collect metrics, track potential failures, and gather granular insights into container behavior. So, it’s not a question of whether or […]
View PostIn this blog, I’ll provide a step-by-step tutorial on automating a runbook to reduce MTTR by using Amazon EventBridge (EventBridge) and Datadog. Datadog is used as a monitoring tool, and EventBridge is used to remediate issues and automatically resolve any alerts. EventBridge is a serverless event bus. It makes building an event-driven workflow for applications […]
View PostIn 2003, Benjamin Treynor Sloss, generally credited with coining the term “site reliability engineering (SRE),” was put in charge of running Google’s production team, consisting of seven engineers. Before the DevOps movement, this first team of software engineers was tasked to make Google’s already large-scale sites more reliable, efficient, and scalable. Today, SRE is a […]
View PostAWS Premier Consulting Partner Ranks as 10th Fastest-Growing IT Solution Provider SAN FRANCISCO, August 31, 2021 — nClouds (www.nclouds.com), a provider of Amazon Web Services (AWS) and DevOps consulting and implementation services and an AWS Premier Consulting Partner, announced today the expansion of its executive team with the addition of Mark Solomon as Vice President, DevOps […]
View PostSAN FRANCISCO, July 28, 2021 — nClouds (www.nclouds.com), a provider of Amazon Web Services (AWS) and DevOps consulting and implementation services and an AWS Premier Consulting Partner, announced today the expansion of its 24/7 on-call support services to include site reliability engineering services (SRE). A top managed service provider (MSP), the company also announced it […]
View PostHere at nClouds, when we onboard a new Site Reliability Engineering (SRE) customer, our SRE team conducts an onboarding workshop that explains SRE concepts. We collaborate with the customer during the workshop to discuss how to define, measure, and track availability and user happiness, including: Defining SLIs (Service-Level Indicators), the metrics that measure compliance with […]
View PostBenjamin Treynor Sloss (VP Engineering, Google) is generally credited with coining the term “site reliability engineering (SRE)” in 2003 and it’s buzzworthiness has grown in the past few years. Beyond the buzzword, how do you know if your company is ready to implement SRE? Here are some signs that you need to consider getting started: […]
View PostTop takeaways: AWS Managed Microsoft AD and Microsoft Active Directory
2022-12-05 15:25:16