Design, Exercise and Latency - Technology Performance Pulse

Interpreting A/B test results: false negatives and power

The Netflix TechBlog

OCTOBER 26, 2021

We then used simple thought exercises based on flipping coins to build intuition around false positives and related concepts such as statistical significance, p-values, and confidence intervals. As a result, if the test treatment results in a small reduction in the latency metric, it’s hard to successfully identify?

Testing

Testing Latency Metrics Innovation

Seamlessly Swapping the API backend of the Netflix Android app

The Netflix TechBlog

SEPTEMBER 8, 2020

For each route we migrated, we wanted to make sure we were not introducing any regressions: either in the form of missing (or worse, wrong) data, or by increasing the latency of each endpoint. Being able to canary a new route let us verify latency and error rates were within acceptable limits. This meant that data that was static (e.g.

Latency

Latency Cache Java Traffic

Automating chaos experiments in production

The Morning Paper

JULY 4, 2019

This is a fascinating paper from members of Netflix’s Resilience Engineering team describing their chaos engineering initiatives: automated controlled experiments designed to verify hypotheses about how the system should behave under gray failure conditions, and to probe for and flush out any weaknesses. Safeguards.

Latency

Latency Engineering Metrics Traffic

Scaling Amazon ElastiCache for Redis with Online Cluster Resizing

All Things Distributed

NOVEMBER 21, 2017

Redis's microsecond latency has made it a de facto choice for caching. Four years ago, as part of our AWS fast data journey, we introduced Amazon ElastiCache for Redis , a fully managed, in-memory data store that operates at microsecond latency. TB of in-memory capacity in a single cluster.

Games

Games Retail Latency Education

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

All Things Distributed

OCTOBER 2, 2017

With these requirements in mind, and a willingness to question the status quo, a small group of distributed systems experts came together and designed a horizontally scalable distributed database that would scale out for both reads and writes to meet the long-term needs of our business. This was the genesis of the Amazon Dynamo database.

Internet

Internet Internet AWS Performance

COVID-19 Hazard Analysis using STPA

Adrian Cockcroft

MARCH 17, 2020

There are many possible failure modes, and each exercises a different aspect of resilience. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. STPA is based on a functional control diagram of the system, and the safety constraints and requirements for each component in the design.

Healthcare

Healthcare Government Airlines Systems

Failure Modes and Continuous Resilience

Adrian Cockcroft

NOVEMBER 11, 2019

There are many possible failure modes, and each exercises a different aspect of resilience. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. STPA is based on a functional control diagram of the system, and the safety constraints and requirements for each component in the design.

Latency

Latency Engineering Systems Hardware

Failure Modes and Continuous Resilience

Adrian Cockcroft

NOVEMBER 11, 2019

There are many possible failure modes, and each exercises a different aspect of resilience. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. STPA is based on a functional control diagram of the system, and the safety constraints and requirements for each component in the design.

Latency

Latency Engineering Systems Hardware

A persistent problem: managing pointers in NVM

The Morning Paper

DECEMBER 8, 2019

(Byte-addressable non-volatile memory,) NVM will fundamentally change the way hardware interacts, the way operating systems are designed, and the way applications operate on data. Therefore any programming abstraction must be low latency and the kernel needs to be kept off the path of persistent data access as much as possible.

Hardware

Hardware Programming Media Storage

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

The Morning Paper

JANUARY 23, 2020

1:18pm a key observation was made that an API call to populate the homepage sidebar saw a huge jump in latency. The process tracing exercise included: Examning IRC transcripts from multiple channels. Members of the team begin diagnosing the issue using the #sysops and #warroom internal IRC channels.

Internet

Internet Internet Cache Engineering

Why I hate MPI (from a performance analysis perspective)

John McCalpin

AUGUST 1, 2018

This is an intellectually challenging and labor-intensive exercise, requiring detailed review of the published details of each of the components of the system, and usually requiring significant “detective work” (using customized microbenchmarks, hardware performance counter analysis, and creative thinking) to fill in the gaps.

Hardware

Hardware Transportation Performance Latency

Amazon EC2 Cluster GPU Instances - All Things Distributed

All Things Distributed

NOVEMBER 15, 2010

For example, the most fundamental abstraction trade-off has always been latency versus throughput. These trade-offs have even impacted the way the lowest level building blocks in our computer architectures have been designed. The throughput of this pipeline is more important than the latency of the individual operations.

AWS

AWS Latency Programming Architecture

Fixing a slow site iteratively

CSS - Tricks

APRIL 1, 2021

With all of this in mind, I thought improving the speed of my own version of a slow site would be a fun exercise. Redirects are often pretty light in terms of the latency that they add to a website, but they are an easy first thing to check, and they can generally be removed with little effort. Improvement #2: The Critical Render Path.

Cache

Cache Social Media Media Website

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

John McCalpin

JANUARY 22, 2018

There was no deep goal — just a desire to see the maximum GFLOPS in action. The exercise seemed simple enough — just fix one item in the Colfax code and we should be finished. Using the minimum number of accumulator registers needed to tolerate the pipeline latency (12), the assembly code for the inner loop is: B1.8:

Latency

Latency Hardware Code Testing

Transforming enterprise integration with reactive streams

O'Reilly Software

MARCH 7, 2018

Service-oriented architecture (SOA) was hyped in the mid-2000s as a modern take on distributed systems architecture, which through modular design would provide productivity through loose coupling between collaborative services—so-called "WebServices"—communicating through externally published APIs. of ( Invoice. alsoTo ( Sink. of ( Order.

Transportation

Transportation Java Programming Architecture

Good Management Can Work Miracles

The Agile Manager

AUGUST 23, 2007

clever at design, at problem solving, and so forth) people who don’t respond to that style of management. Management of a professional capability is thus an exercise in making lots of adjustments, not pushing blindly ahead toward a solution. Top-down, high-control management techniques have proven to be orthogonal to the IT industry.

Innovation

Innovation Technology Technology Latency

MezzFS?—?Mounting object storage in Netflix’s media processing platform

The Netflix TechBlog

MARCH 6, 2019

Video encoding is what MezzFS was originally designed for and remains one of its canonical use cases, so we’ll focus on video encoding to describe the problem that MezzFS solves. We parallelize rerun jobs with Titus , Netflix’s container management platform, which allows us to exercise many hundreds of replay files in minutes.

Media

Media Storage Processing Cache

Technology Performance Pulse

Interpreting A/B test results: false negatives and power

Seamlessly Swapping the API backend of the Netflix Android app

Trending Sources

Automating chaos experiments in production

Scaling Amazon ElastiCache for Redis with Online Cluster Resizing

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

COVID-19 Hazard Analysis using STPA

Failure Modes and Continuous Resilience

Failure Modes and Continuous Resilience

A persistent problem: managing pointers in NVM

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

Why I hate MPI (from a performance analysis perspective)

Amazon EC2 Cluster GPU Instances - All Things Distributed

Fixing a slow site iteratively

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Transforming enterprise integration with reactive streams

Good Management Can Work Miracles

MezzFS?—?Mounting object storage in Netflix’s media processing platform

Stay Connected