Header background

Automated Change Impact Analysis with Site Reliability Guardian

The Dynatrace® Site Reliability Guardian simplifies the adoption of DevOps and SRE best practices to ensure reliable, secure, and high-quality releases.

Powered by Grail and the Dynatrace AutomationEngine, Site Reliability Guardian helps DevOps platform teams make better-informed release decisions by utilizing all the contextual observability and application security insights of the Dynatrace platform. The app also enables SREs to set and automate service-level objectives (SLOs) for critical services. Site Reliability Guardian further simplifies automation requirements by using entity and ownership information, alerting the right teams when regressions are detected. Dynatrace users can now unlock answer-driven automation to safely and securely release new features to customers while ensuring production SLOs are intact.

Streamline development and delivery processes

Nowadays, digital transformation strategies are executed by almost every organization across all industries. In the context of application development, digital transformation aims to streamline development and delivery processes to ensure quick delivery to the market with guaranteed high quality and security standards.

To achieve this, many organizations are adopting DevOps practices to provide developers with a delivery platform to release their applications and services autonomously and independently. While this empowers teams to frequently deliver new features, the overall business, security, and quality objectives must be maintained. This is where Site Reliability Engineering (SRE) practices are applied. SREs use Service-Level Indicators (SLI) to see the complete picture of service availability, latency, performance, and capacity across various systems, especially revenue-critical systems. Consequently, Service-Level objectives (SLO) are defined to enact countermeasures before the business is impacted.

Given the momentum of DevOps and SRE, digital transformation goals can be achieved when automation enables organizations to apply best practices rapidly and to keep pace with the scale of the organization and applications.

SLOs graphic

Automation is key for effective collaboration

While DevOps engineers rely on observability and automation to determine whether they can promote a new release into production, SREs look for answers when a new release has a positive or negative impact on application performance, affects user experience, or threatens other SLOs. SREs face ever more challenging situations as environment complexity increases, applications scale up, and organizations grow:

  • Growing dependency graphs result in blind spots and the inability to correlate performance metrics with user experience.
  • Siloed teams and multiple tools make it difficult to align on a single version of the truth for overall system health.

Other observability solutions don’t provide the required automation capabilities that would allow:

  • Automating and speeding up the SLO validation process and quickly reacting to regressions detected in application topology.
  • Informing the right people with the answers they need to implement targeted countermeasures.

Automate the validation of key objectives

Dynatrace evolves Cloud Automation release validation (powered by Keptn) into Site Reliability Guardian and natively enriches the Dynatrace platform by automating change impact analysis. For each service, a guardian can automatically validate service-level objectives, “golden signals,” or security vulnerabilities before and after any deployment or configuration changes. It detects regressions and deviations from previously observed behavior, including latency, traffic, error rates, saturation, security coverage, vulnerability risk levels, and memory consumption. Thus, Site Reliability Guardian supports DevOps and SREs in speeding up release delivery and improving release quality.

  • DevOps achieve safer and more secure releases by applying a gating mechanism that identifies release issues quickly and prevents poor-quality code from being promoted to production.
  • SREs leverage on-demand reliability validation by comparing observability data like service-level objectives against a dedicated release version from the past or as part of progressive delivery.

Get started

Get started with Site Reliability Guardian by selecting the components that you want to guard automatically. Relevant objectives, known as the four golden signals, are suggested by default:

  • Latency: the time it takes to service a request
  • Traffic: a measure of how much demand is placed on your system
  • Errors: the rate of failed requests
  • Saturation: a measure of a system’s remaining capacity, emphasizing the resources that are most constrained

The four golden signals are recommended for end-user-facing services because they have proven to be a good starting point for diving into SRE practices. Additionally, you can easily use any previously defined metrics and SLOs from your environments.

Site Reliability Guardian showing a failed validation
Site Reliability Guardian showing a failed validation due to an increase of error logs for a cart service.

Automation workflow runs validations and notifies responsible team members

With a solid set of objectives to validate, automatically generated workflows take care of fully automated validation for you. Whether triggered by a test result or a new release deployment, detected events work as a trigger to check the defined objectives and derive an overall status automatically. Based on the results, further actions are taken, like sending targeted notifications to the service and application owners. This is all available out-of-the-box with the default workflow template provided by Site Reliability Guardian.

Workflow leveraging Site Reliability Guardian
Workflow leveraging Site Reliability Guardian to make release decisions.

In addition, the workflow can be easily extended to any custom demands, for example, integration with tools that support your software product lifecycle. This includes executing tests, running Dynatrace Synthetic checks, or creating tickets. In this way, Dynatrace supports an ecosystem of tool integrations that unlock the full potential of DevOps and SRE with cloud-native delivery orchestration and remediation automation use cases.

What’s next?

Site Reliability Guardian is built for ensuring and maintaining production reliability and resilience. Further enhancements will focus on seamlessly integrating the solution into your existing pipelines to help you shift left and block poor-quality releases before they reach production. Davis will also assist Site Reliability Guardian in recommending relevant objectives and baselines for comparison.

  • For an early preview of this new functionality, please request Dynatrace platform preview access, which allows you to install the Site Reliability Guardian via Dynatrace Hub.
  • The Dynatrace Site Reliability Guardian will be available within the next 90 days. In the meantime, watch out for upcoming “Observability Clinics” and “Ask me anything” sessions covering the main topics of this blog post. You can either see a list of the next webinars on our website or follow us on LinkedIn to stay up-to-date with upcoming announcements and activities.