Design, Exercise, Latency and Processing - Technology Performance Pulse

Seamlessly Swapping the API backend of the Netflix Android app

The Netflix TechBlog

SEPTEMBER 8, 2020

For each route we migrated, we wanted to make sure we were not introducing any regressions: either in the form of missing (or worse, wrong) data, or by increasing the latency of each endpoint. Being able to canary a new route let us verify latency and error rates were within acceptable limits. This meant that data that was static (e.g.

Latency

Latency Cache Java Traffic

Automating chaos experiments in production

The Morning Paper

JULY 4, 2019

This is a fascinating paper from members of Netflix’s Resilience Engineering team describing their chaos engineering initiatives: automated controlled experiments designed to verify hypotheses about how the system should behave under gray failure conditions, and to probe for and flush out any weaknesses. Safeguards.

Latency

Latency Engineering Metrics Traffic

COVID-19 Hazard Analysis using STPA

Adrian Cockcroft

MARCH 17, 2020

There are many possible failure modes, and each exercises a different aspect of resilience. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. Staff should be familiar with recovery processes and the behavior of the system when it’s working hard to mitigate failures.

Healthcare

Healthcare Government Airlines Systems

Scaling Amazon ElastiCache for Redis with Online Cluster Resizing

All Things Distributed

NOVEMBER 21, 2017

Amazon ElastiCache embodies much of what makes fast data a reality for customers looking to process high volume data at incredible rates, faster than traditional databases can manage. Redis's microsecond latency has made it a de facto choice for caching. TB of in-memory capacity in a single cluster.

Games

Games Retail Latency Education

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

All Things Distributed

OCTOBER 2, 2017

With these requirements in mind, and a willingness to question the status quo, a small group of distributed systems experts came together and designed a horizontally scalable distributed database that would scale out for both reads and writes to meet the long-term needs of our business. This was the genesis of the Amazon Dynamo database.

Internet

Internet Internet AWS Performance

Failure Modes and Continuous Resilience

Adrian Cockcroft

NOVEMBER 11, 2019

There are many possible failure modes, and each exercises a different aspect of resilience. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. Staff should be familiar with recovery processes and the behavior of the system when it’s working hard to mitigate failures.

Latency

Latency Engineering Systems Hardware

Failure Modes and Continuous Resilience

Adrian Cockcroft

NOVEMBER 11, 2019

There are many possible failure modes, and each exercises a different aspect of resilience. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. Staff should be familiar with recovery processes and the behavior of the system when it’s working hard to mitigate failures.

Latency

Latency Engineering Systems Hardware

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

The Morning Paper

JANUARY 23, 2020

1:18pm a key observation was made that an API call to populate the homepage sidebar saw a huge jump in latency. A technique called process tracing was employed to try and recover "a record of participant data acquisition, situation assessment, knowledge activation, expectations, intentions, and actions as the case unfolds over time."

Internet

Internet Internet Cache Engineering

Amazon EC2 Cluster GPU Instances - All Things Distributed

All Things Distributed

NOVEMBER 15, 2010

From financial processing and traditional oil & gas exploration HPC applications to integrating complex 3D graphics into online and mobile applications, the applications of GPU processing appear to be limitless.Â For example, the most fundamental abstraction trade-off has always been latency versus throughput.

AWS

AWS Latency Programming Architecture

A persistent problem: managing pointers in NVM

The Morning Paper

DECEMBER 8, 2019

(Byte-addressable non-volatile memory,) NVM will fundamentally change the way hardware interacts, the way operating systems are designed, and the way applications operate on data. Therefore any programming abstraction must be low latency and the kernel needs to be kept off the path of persistent data access as much as possible.

Hardware

Hardware Programming Media Storage

Fixing a slow site iteratively

CSS - Tricks

APRIL 1, 2021

With all of this in mind, I thought improving the speed of my own version of a slow site would be a fun exercise. Redirects are often pretty light in terms of the latency that they add to a website, but they are an easy first thing to check, and they can generally be removed with little effort. Improvement #2: The Critical Render Path.

Cache

Cache Social Media Media Website

Transforming enterprise integration with reactive streams

O'Reilly Software

MARCH 7, 2018

Working with never-ending streams of data necessitates continuous processing of it, ensuring the system keeps up with the load patterns it is exposed to, and always provides real-time, up-to-date information. Sadly, it is very common to write logic that deeply couples the notion of IO with the processing of that IO.

Transportation

Transportation Java Programming Architecture

Good Management Can Work Miracles

The Agile Manager

AUGUST 23, 2007

clever at design, at problem solving, and so forth) people who don’t respond to that style of management. Management of a professional capability is thus an exercise in making lots of adjustments, not pushing blindly ahead toward a solution. Top-down, high-control management techniques have proven to be orthogonal to the IT industry.

Innovation

Innovation Technology Technology Latency

MezzFS?—?Mounting object storage in Netflix’s media processing platform

The Netflix TechBlog

MARCH 6, 2019

—?Mounting object storage in Netflix’s media processing platform By Barak Alon (on behalf of Netflix’s Media Cloud Engineering team) MezzFS (short for “Mezzanine File System”) is a tool we’ve developed at Netflix that mounts cloud objects as local files via FUSE. Encoding is not a one-time process?—?large We have one file?—?the

Media

Media Storage Processing Cache

Technology Performance Pulse

Seamlessly Swapping the API backend of the Netflix Android app

Automating chaos experiments in production

Trending Sources

COVID-19 Hazard Analysis using STPA

Scaling Amazon ElastiCache for Redis with Online Cluster Resizing

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

Failure Modes and Continuous Resilience

Failure Modes and Continuous Resilience

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

Amazon EC2 Cluster GPU Instances - All Things Distributed

A persistent problem: managing pointers in NVM

Fixing a slow site iteratively

Transforming enterprise integration with reactive streams

Good Management Can Work Miracles

MezzFS?—?Mounting object storage in Netflix’s media processing platform

Stay Connected