Remove Design Remove Exercise Remove Latency Remove Processing
article thumbnail

Seamlessly Swapping the API backend of the Netflix Android app

The Netflix TechBlog

For each route we migrated, we wanted to make sure we were not introducing any regressions: either in the form of missing (or worse, wrong) data, or by increasing the latency of each endpoint. Being able to canary a new route let us verify latency and error rates were within acceptable limits. This meant that data that was static (e.g.

Latency 233
article thumbnail

Automating chaos experiments in production

The Morning Paper

This is a fascinating paper from members of Netflix’s Resilience Engineering team describing their chaos engineering initiatives: automated controlled experiments designed to verify hypotheses about how the system should behave under gray failure conditions, and to probe for and flush out any weaknesses. Safeguards.

Latency 77
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

COVID-19 Hazard Analysis using STPA

Adrian Cockcroft

There are many possible failure modes, and each exercises a different aspect of resilience. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. Staff should be familiar with recovery processes and the behavior of the system when it’s working hard to mitigate failures.

article thumbnail

Scaling Amazon ElastiCache for Redis with Online Cluster Resizing

All Things Distributed

Amazon ElastiCache embodies much of what makes fast data a reality for customers looking to process high volume data at incredible rates, faster than traditional databases can manage. Redis's microsecond latency has made it a de facto choice for caching. TB of in-memory capacity in a single cluster.

Games 112
article thumbnail

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

All Things Distributed

With these requirements in mind, and a willingness to question the status quo, a small group of distributed systems experts came together and designed a horizontally scalable distributed database that would scale out for both reads and writes to meet the long-term needs of our business. This was the genesis of the Amazon Dynamo database.

Internet 128
article thumbnail

Failure Modes and Continuous Resilience

Adrian Cockcroft

There are many possible failure modes, and each exercises a different aspect of resilience. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. Staff should be familiar with recovery processes and the behavior of the system when it’s working hard to mitigate failures.

Latency 52
article thumbnail

Failure Modes and Continuous Resilience

Adrian Cockcroft

There are many possible failure modes, and each exercises a different aspect of resilience. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. Staff should be familiar with recovery processes and the behavior of the system when it’s working hard to mitigate failures.

Latency 53