Remove 2019 Remove Latency Remove Metrics Remove Monitoring
article thumbnail

Site reliability done right: 5 SRE best practices that deliver on business objectives

Dynatrace

As a result, site reliability has emerged as a critical success metric for many organizations. The practice uses continuous monitoring and high levels of automation in close collaboration with agile development teams to ensure applications are highly available and perform without friction. Service-level objectives (SLOs). availability.

article thumbnail

Netflix at AWS re:Invent 2019

The Netflix TechBlog

In this session, we discuss the technologies used to run a global streaming company, growing at scale, billions of metrics, benefits of chaos in production, and how culture affects your velocity and uptime. Netflix runs dozens of stateful services on AWS under strict sub-millisecond tail-latency requirements, which brings unique challenges.

AWS 100
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Netflix at AWS re:Invent 2019

The Netflix TechBlog

In this session, we discuss the technologies used to run a global streaming company, growing at scale, billions of metrics, benefits of chaos in production, and how culture affects your velocity and uptime. Netflix runs dozens of stateful services on AWS under strict sub-millisecond tail-latency requirements, which brings unique challenges.

AWS 100
article thumbnail

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls. Additionally, it became easy to provide deep links to different monitoring and deployment systems in Edgar due to consistent tagging.

article thumbnail

Netflix at AWS re:Invent 2019

The Netflix TechBlog

In this session, we discuss the technologies used to run a global streaming company, growing at scale, billions of metrics, benefits of chaos in production, and how culture affects your velocity and uptime. In 2019, Netflix moved thousands of container hosts to bare metal. Thursday?—?December

AWS 37
article thumbnail

Open Observability – Part 1: Distributed tracing and observability

Dynatrace

How is monitoring different from observability? Already in the 2000s, service-oriented architectures (SOA) became popular, and operations teams discovered the need to understand how transactions traverse through all tiers and how these tiers contributed to the execution time and latency. Observability vs. monitoring.

article thumbnail

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

Service throttling Zuul can sense when a back-end service is in trouble by monitoring the error rates and concurrent requests to that service. Those two metrics are approximate indicators of failures and latency. The key metrics used to trigger global throttling are CPU utilization, concurrent requests, and connection count.

Traffic 252