Chaos engineering in Kubernetes using Chaos Mesh

11Aug,22 Post Image

Site reliability engineering (SRE) teams often use chaos engineering to proactively prove and improve resilience during fault conditions. At nClouds, we use chaos engineering to experiment with our infrastructure. We check for weak points in our systems and applications by intentionally injecting failure events within controlled environments. In this way, we evaluate how our systems […]

View Post

Best practices for effective incident management on AWS

06May,22 Post Image

Incident management is the technique used by IT and DevOps teams when responding to any unplanned incident or interruption. An incident is any event that requires an immediate response from the operations team. Incident management intervenes to restore services to their operational state. Incident management is a necessity in today’s always-on competitive world. Why use […]

View Post

How to use code-free Datadog Synthetic Monitoring for simulated API and browser testing

21Jan,22 Post Image

Why container monitoring is critical for modern cloud environments Modern cloud application environments are complex, running across hundreds or even thousands of compute instances. Because of this complexity, modern applications require container monitoring to continuously collect metrics, track potential failures, and gather granular insights into container behavior. So, it’s not a question of whether or […]

View Post

Tutorial: How to automate a runbook to reduce MTTR

05Oct,21 Post Image

In this blog, I’ll provide a step-by-step tutorial on automating a runbook to reduce MTTR by using Amazon EventBridge (EventBridge) and Datadog. Datadog is used as a monitoring tool, and EventBridge is used to remediate issues and automatically resolve any alerts. EventBridge is a serverless event bus. It makes building an event-driven workflow for applications […]

View Post

Five fundamental best practices for your
SRE team

23Sep,21 Post Image

In 2003, Benjamin Treynor Sloss, generally credited with coining the term “site reliability engineering (SRE),” was put in charge of running Google’s production team, consisting of seven engineers. Before the DevOps movement, this first team of software engineers was tasked to make Google’s already large-scale sites more reliable, efficient, and scalable. Today, SRE is a […]

View Post

nClouds expands executive team with DevOps, SRE, FinOps industry experts

31Aug,21 Post Image

AWS Premier Consulting Partner Ranks as 10th Fastest-Growing IT Solution Provider SAN FRANCISCO, August 31, 2021 — nClouds (www.nclouds.com), a provider of Amazon Web Services (AWS) and DevOps consulting and implementation services and an AWS Premier Consulting Partner, announced today the expansion of its executive team with the addition of Mark Solomon as Vice President, DevOps […]

View Post

nClouds expands 24/7 support with site reliability engineering services |
AWS Premier Consulting Partner achieves Datadog Gold Tier MSP Partner status

28Jul,21 Post Image

SAN FRANCISCO, July 28, 2021 — nClouds (www.nclouds.com), a provider of Amazon Web Services (AWS) and DevOps consulting and implementation services and an AWS Premier Consulting Partner, announced today the expansion of its 24/7 on-call support services to include site reliability engineering services (SRE). A top managed service provider (MSP), the company also announced it […]

View Post

A real-world example: How to set Service-Level Objectives (SLOs)

15Jul,21 Post Image

Here at nClouds, when we onboard a new Site Reliability Engineering (SRE) customer, our SRE team conducts an onboarding workshop that explains SRE concepts. We collaborate with the customer during the workshop to discuss how to define, measure, and track availability and user happiness, including: Defining SLIs (Service-Level Indicators), the metrics that measure compliance with […]

View Post

Getting started with site reliability engineering (SRE)

14Jul,21 Post Image

Benjamin Treynor Sloss (VP Engineering, Google) is generally credited with coining the term “site reliability engineering (SRE)” in 2003 and it’s buzzworthiness has grown in the past few years. Beyond the buzzword, how do you know if your company is ready to implement SRE? Here are some signs that you need to consider getting started: […]

View Post

Search Blog

Categories

Recent Posts

Subscribe to Our Newsletter

Join our community of DevOps enthusiast - Get free tips, advice, and insights from our industry leading team of AWS experts.