article thumbnail

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

Replay Traffic Testing Replay traffic refers to production traffic that is cloned and forked over to a different path in the service call graph, allowing us to exercise new/updated systems in a manner that simulates actual production conditions. This approach has a handful of benefits.

Traffic 339
article thumbnail

Seamlessly Swapping the API backend of the Netflix Android app

The Netflix TechBlog

For each route we migrated, we wanted to make sure we were not introducing any regressions: either in the form of missing (or worse, wrong) data, or by increasing the latency of each endpoint. Being able to canary a new route let us verify latency and error rates were within acceptable limits.

Latency 233
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

COVID-19 Hazard Analysis using STPA

Adrian Cockcroft

There are many possible failure modes, and each exercises a different aspect of resilience. Staff should be familiar with recovery processes and the behavior of the system when it’s working hard to mitigate failures. Is the model of the controlled process looking at the right metrics and behaving safely?

article thumbnail

Service level objectives: 5 SLOs to get started

Dynatrace

Response time Response time refers to the total time it takes for a system to process a request or complete an operation. This ensures that customers can quickly navigate through product listings, add items to their cart, and complete the checkout process without experiencing noticeable delays. or above for the checkout process.

Latency 181
article thumbnail

Service level objective examples: 5 SLO examples for faster, more reliable apps

Dynatrace

Response time Response time refers to the total time it takes for a system to process a request or complete an operation. This ensures that customers can quickly navigate through product listings, add items to their cart, and complete the checkout process without experiencing noticeable delays. or above for the checkout process.

Traffic 173
article thumbnail

Real user monitoring vs. synthetic monitoring: Understanding best practices

Dynatrace

Real user monitoring (RUM) is a performance monitoring process that collects detailed data about users’ interactions with an application. Customized tests based on specific business processes and transactions — for example, a user that is leveraging services when accessing an application. What is real user monitoring?

article thumbnail

Automating chaos experiments in production

The Morning Paper

They use a combination of timeouts, retries, and fallbacks to try to mitigate the effects of these failures, but these don’t get exercised as often as the happy path, so how can we be confident they’ll work as intended when called upon? On error rates.

Latency 77