Header background

The state of site reliability engineering: SRE challenges and best practices in 2023

Site reliability engineering (SRE) has become increasingly important to organizations looking to keep up with the rapid pace of digital transformation. Now more than ever, customers expect high-quality, reliable digital services that offer seamless user experiences. SRE ensures dependability and consistency throughout digital environments, providing the framework for which organizations can continuously deliver these ideal experiences to customers.

Dynatrace product marketing director of DevOps Saif Gunja hosted the 2023 State of SRE webinar. Joining Gunja for the webinar were SREs Danne Aguiar from Kyndryl, Hilliary Lipsig from Red Hat, and Stephen Townshend from SquaredUp. They discussed best practices, emerging trends, effective mindsets for establishing service-level objectives (SLOs), and more. Together, the host and panelists offered their insights into how organizations can enhance their SRE efforts.

Effective site reliability engineering requires enterprise-wide transformation

Without a unified understanding of SRE practices, organizational silos can quickly form between departments. Lack of collaboration leads to siloed observability data and leaves teams with little information to work from when attempting to deliver value. Without mature SRE adoption practices, productivity suffers.

A cultural shift that embraces SRE is key to breaking these silos. The panelists emphasized the necessity of an organization-wide cultural shift towards SRE adoption. They also underscored the importance of top-down approval for cultural transformation. “[Without executive approval], you hit a ceiling,” said Townshend. “You just can’t make any traction because of competing priorities.”

Gunja agreed. “If it’s not a cultural change, if it’s not coming from the top-down, most likely it will fail,” he said. “And even if it does come from the top down, there are still a lot of hurdles to cross.”

Lipsig saw this phenomenon from the other side. With top-down SRE adoption at her organization, the siloed culture visibly improved. “[I’ve] seen a lot of really great relationships that were either non-existent — or maybe a little bit rocky — trending in a better direction over the last 12 months,” she said. Evidently, executive approval streamlines understanding of SRE among various teams, leading to increased collaboration and education across an organization.

However, despite the necessity of this transformation to achieve business objectives, many higher-level executives are still hesitant to adopt SRE practices. This can often result from a lack of understanding of the discipline’s role in delivering on key performance objectives: SLOs.

To overcome this hurdle, the panelists recommended that engineers communicate the value of SRE to higher-level executives through business data. Upon gathering these metrics, engineers may then demonstrate how the implementation of strong, enterprise-wide SRE practices can help reduce toil, employee burnout, operational expenses, and the number of unaccomplished SLOs.

SLOs should be focused and driven by high-level business goals

When creating SLOs to measure SRE success, it is important to keep in mind how the objectives will benefit the organization. At times, engineering teams can become preoccupied with the minutiae of technological endeavors and lose sight of overall business goals. Teams should ensure that even the smallest SLOs relate to business growth.

However, understanding how technical SLOs impact business outcomes isn’t always completely straightforward. For example, how much does a reduction in MTTR impact revenue? To answer these questions, cross-functional collaboration is essential to organizational success. Communication between teams of different skill sets can help demystify the connections between SLOs and business outcomes.

It is also important to note that creating business-centric SLOs doesn’t mean solely focusing on high-level goals. In fact, the panelists emphasized the importance of creating smaller SLOs that can be used to better measure progress. By recognizing small wins, teams can avoid becoming overwhelmed by the thought of reaching larger objectives. These small wins, such as implementing a blameless root cause analysis process, can take many forms and don’t necessarily involve numerical metrics.

For organizations building business-centric SLOs, Aguiar had some recommendations. “If your company has a service-level agreement (SLA), start with that,” he said. “You can practice with this specific SLO that is set by your SLA, and then you can define others later.”

Lipsig also had words of wisdom. “Pick one thing that measures whether or not your customer has been successful in their engagement with your product, and then work on how to measure that,” she said. Business-centric SLOs are driven by client success: when the customer wins, so does the business. Thus, careful consideration of client needs is a crucial aspect of creating effective SLOs.

Customer empathy is key to a fully optimized site reliability engineering practice

Software engineering can often be an impersonal discipline. SRE is not typically a customer-facing role, so it can be easy to misunderstand the context of a client’s pain points. This lack of clarity can lead to slowed remediation times and ineffective solutions. Moreover, customers may become frustrated by the inefficiencies that characterize weak organizational relationships, resulting in poor retention rates.

Building customer relationships is another case in which interdepartmental collaboration is crucial for SRE. Panelists encouraged engineers to consult with customer-facing teams to better understand the context of clients’ situations and address key needs. “I created really good partnerships with our customer success engineers,” Lipsig shared. But she emphasized the importance of strong internal collaboration: “[Building trust with the customers] was not something I could have done by myself.”

Working to understand client needs fosters trust between the organization and the client. As a result, clients are more likely to implement suggestions from SRE teams, giving engineers increased agency.

Panelists also expressed the importance of “soft skills” when handling client interactions. The ability to communicate with customers respectfully and patiently is key to building the essential trust needed for SRE. They also highlighted that this practice should apply not only to customers but also to colleagues within an organization.

Generative AI and the future of site reliability engineering

“AI is not new in the APM world,” reminded Aguiar. Recent breakthroughs in generative AI may offer advantages to SRE teams across organizations of all kinds. For example, generative AI has the potential to provide more intuitive methods of querying data. This is made possible through generative AI’s natural language processing capabilities. These capabilities enable data insight collection without using a formalized query language. Without the barrier of query language, data is more easily accessible and less likely to become siloed.

Generative AI can also help improve root cause analysis by allowing users to ask specific questions regarding architecture and digital environments. Access to quick, reliable answers fosters rapid learning among teams. This accessibility results in reduced MTTR and improved productivity.

The panelists speculated that AI will likely improve quality of life for SRE teams through its ability to efficiently execute tasks. Aguiar forecasted that a key function of generative AI for SRE will involve creating runbooks based on past experiences. These would have the potential to largely eliminate manual intervention and lengthy processes to address regularly occurring incidents. However, Lipsig reminded the panel that SRE looks different across organizations. “We’ll see lots of different types of impact versus one definitive impact [from generative AI],” she said.

Generative AI is a promising new asset that SRE teams can uniquely apply to their practices to achieve greater efficiency, but it is not a complete replacement for certain preexisting reliability measures.

Successful site reliability engineering favors proactive over reactive measures

Unexpected system outages, server overloads, and other unforeseen events can have potentially disastrous effects on not only the productivity of SREs but also organizational profitability. These challenges may lead to massive amounts of unplanned work that situates SRE in a reactive state where efficiency and true progress are continually thwarted. Performing root cause analysis within these reactive models can be a lengthy and costly process, leaving SREs severely under-resourced. To mitigate this, SRE teams must initiate planned work to begin operating proactively.

A key component of a proactive SRE model involves the implementation of end-to-end monitoring, including on systems that are not directly owned by the SRE team’s organization. By maintaining strong observability of clients’ and vendors’ systems, teams can identify potential software issues before they proliferate. Strong black box monitoring, load balance analysis, and routine system checks are all examples of proactive work that can reap great benefits in productivity and incident prevention.

With organizations spending ample resources on gathering and storing data, SRE teams have additional incentive to shift away from a reactive work model. Valuable data is underused within such a model and is solely leveraged for response instead of proactivity. It is in organizations’ best interest to extract the full potential of data-driven insights by creating workflows that emphasize fire prevention, not firefighting.

“We’re starting to respond to the pager on the [service-level indicator] breach so that we’re always saving our SLOs,” said Lipsig, on how SREs at Red Hat approach incidents. “We’re never burning too much of our error budget.” Once teams begin using data proactively, “[They can] start doing meaningful work with that data instead of just leveraging it for response.”

Increasing collaboration is critical to meeting SLOs

In today’s technological landscape, there has been significant debate surrounding the best approaches to software engineering within cloud-native architectures. Whether that approach is SRE, DevOps, or platform engineering, the panelists asserted that departmental categorization is not nearly as important as the actual work happening in those departments. Instead of becoming preoccupied with job titles, teams should focus on reaching SLOs effectively and efficiently. Breaking out of the mindset that DevOps, SRE, and platform engineering are opposing forces is a vital step toward mitigating silos and ensuring satisfied SLO criteria.

“SRE is about designing, building, and operating reliable services at scale,” said Townshend. “And as long as I’m doing that, I think I’m successful.”

To learn more about the current state of SRE across a wide range of industries, key challenges, and how the discipline is evolving, download the free State of SRE report.