In 2003, Benjamin Treynor Sloss, generally credited with coining the term “site reliability engineering (SRE),” was put in charge of running Google’s production team, consisting of seven engineers. Before the DevOps movement, this first team of software engineers was tasked to make Google’s already large-scale sites more reliable, efficient, and scalable. Today, SRE is a practice that applies software development skills, and that mindset, to IT operations. The goal of SRE is to create scalable and highly reliable software systems via automation and continuous integration and delivery.
The work of SRE consists of following an evolving set of reliability “best practices.” As you know, best practices are general in nature — the specific activities that work for Google won’t necessarily work for your organization. Therefore, your SRE team needs to understand the “spirit” of the best practices well enough so that they can tailor them to craft your own brand of SRE that meets the unique needs of your business. The unique fingerprint of processes, structure, and people of your individual organization will challenge your SRE team to apply creativity and ingenuity to meet the business goal — i.e., sustained superior customer experience via system reliability.
For organizations defining the role of their SRE team, the following is a list of five fundamental best practices to consider including in their repertoire.
Set realistic targets: 100% reliability is not a realistic target.
Modern microservices environments consist of various unique components and external dependencies. In other words, a microservices architecture is a cloud-native architectural approach in which a single application is composed of many loosely coupled and independently deployable smaller components or services, like Netflix, Uber, and Etsy. It is inevitable that systems sometimes fail, which results in less than 100% reliability. Therefore, it is programming suicide to promise more than can be achieved.
Don’t have too many SLIs and SLOs.
It is essential to monitor all major components of the platform. However, creating SLOs (Service-Level Objectives) for each component is not advisable. It is more effective to group services according to their capabilities and set up achievement targets for them. For example, group a system into categories, like storage, database, pipeline, etc. Then begin defining and implementing SLOs and SLIs (Service-Level Indicators).
Set SLIs and SLOs for components exposed to customers.
In modern complex systems, platforms consist of hundreds of components. Defining SLIs for every component is not feasible. The best way to define and implement SLOs and SLIs is against system boundaries. In other words, at the points where one or more components are exposed to external customers.
Documentation and stakeholder agreement.
A document detailing Service-Level Indicators and objectives should be created and placed at a centralized location where all the stakeholders can view details on-demand. The SLO should only then be implemented after all stakeholders’ approval and include details like the stakeholders’ names, date of approval, error budget, SLOs, and SLIs. This document must be written in plain language, including the metrics to be used and all details delineated within the error budget.
Continuous feedback and continuous improvements.
SRE is inseparable from system refinement and SLO fine-tuning. SRE creates a process to review all the outages, overviews, deployments, and targets within a specific time frame to keep the systems reliable and maintain program efficiency. Since we realize that nothing is guaranteed to be 100% reliable in our industry, it is beneficial to conduct blameless postmortems to prevent repeat incidents and improve future responses. In other words, no witch hunts to defer guilt. Assume that every team member acted with the best intentions based on the information they had at the time. Instead of seeking out and blasting whoever made a mistake, focus on improving performance and moving forward.