article thumbnail

Site reliability done right: 5 SRE best practices that deliver on business objectives

Dynatrace

With so many of their transactions occurring online, customers are becoming more demanding, expecting websites and applications to always perform perfectly. However, cloud complexity has made software delivery challenging. Identify KPIs Next, create a list of the key performance indicators (KPIs) that are important to the business.

article thumbnail

Automated observability, security, and reliability at scale

Dynatrace

While infrastructure has historically been treated as a bottleneck where proper scaling and compute power are applied to improve performance, these aspects are now typically addressed by hyperscalers that offer cloud-based infrastructure and infrastructure as a service.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Architected for resiliency: How Dynatrace withstands data center outages

Dynatrace

And the last sentence of the email was what made me want to share this story publicly, as it’s a testimonial to how modern software engineering and operations should make you feel. This number was so low because the automatic traffic redirect was so fast it kept the impact so low. Availability Zone) outages.

AWS 182
article thumbnail

AWS observability: AWS monitoring best practices for resiliency

Dynatrace

EC2 is ideally suited for large workloads with constant traffic. Amazon’s AWS monitoring and observability service, CloudWatch, monitors applications, resource usage, and system-wide performance for AWS-hosted environments. AWS Lambda. Amazon CloudWatch. AWS monitoring best practices.

article thumbnail

Evolving Regional Evacuation

The Netflix TechBlog

In the event of an isolated failure we first pre-scale microservices in the healthy regions after which we can shift traffic away from the failing one. can be abstracted by our key performance indicator, stream starts per second (SPS). For example, the graph below shows how CE traffic in us-east-1 changes when we evacuate us-west-2.

Traffic 190
article thumbnail

Consistent caching mechanism in Titus Gateway

The Netflix TechBlog

With traffic growth, a single leader node handling all request volume started becoming overloaded. In that scenario, the system would need to deal with the data propagation latency directly, for example, by use of timeouts or client-originated update tracking mechanisms. Titus Gateways can then be scaled out to match client request volumes.

Cache 224
article thumbnail

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

The Netflix TechBlog

We started seeing signs of scale issues, like: Slowness during peak traffic moments like 12 AM UTC, leading to increased operational burden. At Netflix, the peak traffic load can be a few orders of magnitude higher than the average load. Hence, the system has to withstand bursts in traffic while still maintaining the SLO requirements.

Java 202