Beyond Monitoring and IT Ops: Understanding How Observability Helps the SRE
In the last article in this series on Service Reliability Engineering (SRE) and observability, Jason Bloomberg deconstructed the fundamentals of the SRE approach and why they are essential to the process of developing modern software. In it, he hinted at the indispensability of observability to quantify the reliability vs. innovation trade-off.
This innovative trade-off comes in the form of tension. It's the tension between the need to rapidly deliver a continuous stream of software releases and updates and the impact that process can have on the reliability, availability, and performance of those applications.
As Bloomberg explained, managing this balance demands that organizations establish service level objectives (SLOs) that define acceptable service performance criteria and that they create an error budget that represents the difference between these SLOs and 100% performance.
As Bloomberg alluded to, the challenge is that for this process to work, SREs must be able to accurately and continuously quantify their performance and the expenditure of their error budget. But as many SREs are finding, that's the devil in the details that can often leave them exposed.
Bloomberg explained that observability might be the answer to this challenge, but why and what do SREs need to understand to put it to work for them?
Observability vs. Monitoring — and Why It Matters to SREs
Before we can dive into the ins and outs of observability, we need to address the monitoring elephant in the room. The first questions that tend to surface when it comes to observability in the SRE context are, "Isn't monitoring sufficient?" or "Isn't observability just a fancy new term for monitoring?"
The answer to both of these questions is a resounding "No!," but it's clear that many SREs aren't getting the message. In fact, according to some reports, as few as 50% of SREs are currently leveraging observability in their practice.
Here’s the difference. Monitoring is based on the idea that you can pre-determine the potential areas of concern within a technology stack and then instrument those areas so that you can monitor them. It sounds great. And in the traditional, mostly-static environments of the past, it worked fine.
The problem is that in today’s cloud-native, microservices-fueled, and constantly changing environments, it’s almost impossible to predict what data you’re going to need or what areas may be your source of issues in the future. The complexity and ephemerality of today’s environments make monitoring an imperfect approach to collecting data — and one that requires far too much overhead.
This very realization led to the development of the concept of observability in the first place. At its core, it is flipping the monitoring approach upside down.
Rather than trying to figure out what you’ll need to know in advance, observability is all about collecting external outputs of a system — its events, logs, metrics, and traces — and using that data to infer its internal state, then using that data to manage the environment.
Fundamentally, it’s about creating a continuous and sustainable data feed from your systems that will allow you to deal with the unknown unknowns. It's an inherently different approach that helps SREs close their two most significant gaps.
Read the full article on the Moogsoft blog.
This analysis was commissioned by Moogsoft and produced under an arrangement with Intellyx.