Blog, Latency, Presentation and Systems - Technology Performance Pulse

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

This introductory blog focuses on an overview of our journey. Future blogs will provide deeper dives into each service, sharing insights and lessons learned from this process. This architecture shift greatly reduced the processing latency and increased system resiliency.

Processing

Processing Media Latency Innovation

Consistent caching mechanism in Titus Gateway

The Netflix TechBlog

NOVEMBER 3, 2022

In the time since it was first presented as an advanced Mesos framework, Titus has transparently evolved from being built on top of Mesos to Kubernetes, handling an ever-increasing volume of containers. As the number of Titus users increased over the years, the load and pressure on the system increased substantially.

Cache

Cache Latency Traffic Systems

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

This is where large-scale system migrations come into play. Our previous blog post presented replay traffic testing — a crucial instrument in our toolkit that allows us to implement these transformations with precision and reliability. Canaries and sticky canaries are valuable tools in the system migration process.

Traffic

Traffic Metrics Systems Strategy

Monitoring Distributed Systems

Dotcom-Montior

NOVEMBER 24, 2021

Web developers or administrators did not have to worry or even consider the complexity of distributed systems of today. Great, your system was ready to be deployed. Once the system was deployed, to ensure everything was running smoothly, it only took a couple of simple checks to verify. What is a Distributed System?

Systems

Systems Monitoring Hardware Network

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

These can help you ensure your system’s health and quickly perform root cause analysis of any performance-related issue you might be encountering. This blog post lists the important database metrics to monitor. These essential data points heavily influence both stability and efficiency within the system.

Metrics

Metrics Monitoring Latency Cache

Data ingestion pipeline with Operation Management

The Netflix TechBlog

MARCH 7, 2023

These media focused machine learning algorithms as well as other teams generate a lot of data from the media files, which we described in our previous blog , are stored as annotations in Marken. But we cannot search or present low latency retrievals from files Etc. We refer the reader to our previous blog article for details.

Media

Media Latency Architecture Database

Edgar: Solving Mysteries Faster with Observability

The Netflix TechBlog

SEPTEMBER 2, 2020

Edgar helps Netflix teams troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata. The more complex a system, the more places to look for clues. In an earlier blog post, we discussed Telltale , our health monitoring system.

Latency

Latency Transportation Engineering Traffic

Improved Alerting with Atlas Streaming Eval

The Netflix TechBlog

APRIL 27, 2023

Engineers want their alerting system to be realtime, reliable, and actionable. A few years ago, we were paged by our SRE team due to our Metrics Alerting System falling behind — critical application health alerts reached engineers 45 minutes late! The internals here are outside the scope of this blog post.

Storage

Storage Cache Metrics Database

Three Other Models of Computer System Performance: Part 1

ACM Sigarch

MARCH 18, 2019

With two blog posts, we argue for more use of simple models beyond Amdahl’s Law. This Part 1 discusses Bottleneck Analysis and Little’s Law, while Part 2 presents the M/M/1 Queue. Computer systems, from the Internet-of-Things devices to datacenters, are complex and optimizing them can enhance capability and save money.

Systems

Systems Latency Performance Analytics

Dynatrace accelerates business transformation with new AI observability solution

Dynatrace

JANUARY 31, 2024

While off-the-shelf models assist many organizations in initiating their journeys with generative AI (GenAI), scaling AI for enterprise use presents formidable challenges. This blog post explores how AI observability enables organizations to predict and control costs, performance, and data reliability.

Cache

Cache Azure Infrastructure Monitoring

How To Scale a Single-Host PostgreSQL Database With Citus

Percona

NOVEMBER 3, 2023

Rather than listing the concepts, function calls, etc, available in Citus, which frankly is a bit boring, I’m going to explore scaling out a database system starting with a single host. I won’t cover all the features but show just enough that you’ll want to see more of what you can learn to accomplish for yourself.

Database

Database Benchmarking Latency C++

Site reliability engineering: 5 things you need to know

Dynatrace

FEBRUARY 4, 2021

Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. SRE applies DevOps principles to developing systems and software that help increase site reliability and performance.

Engineering

Engineering DevOps Government Latency

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

Observability is essential to ensure the reliability, security and quality of any software system. In this blog post, we will discuss some of these challenges and how to overcome them. Higher latency and cold start issues due to the initialization time of the functions.

Serverless

Serverless Lambda Azure AWS

Three Other Models of Computer System Performance: Part 2

ACM Sigarch

MARCH 25, 2019

With two blog posts, we argue for more use of simple models beyond Amdahl’s Law. Part 1 previously discussed Bottleneck Analysis and Little’s Law, while this post (Part 2) presents the M/M/1 Queue. How many buffers are needed to track pending requests as a function of needed bandwidth and expected latency? and 1/S as ?.

Systems

Systems Latency Performance C++

USENIX LISA2021 Computing Performance: On the Horizon

Brendan Gregg

JULY 4, 2021

This was a chance to talk about other things I've been working on, such as the present and future of hardware performance. I also wrote about these topics in detail for my recent [Systems Performance 2nd Edition] book. Note that my predictions in this talk may be wrong, but they should be thought provoking. Ford, et al., “TCP

Performance

Performance Latency Hardware Storage

Site reliability engineering: 5 things to you need to know

Dynatrace

FEBRUARY 4, 2021

Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. SRE applies DevOps principles to developing systems and software that help increase site reliability and performance.

Engineering

Engineering DevOps Government Latency

Current status, needs, and challenges in Heterogeneous and Composable Memory from the HCM workshop (HPCA’23)

ACM Sigarch

MAY 31, 2023

Introduction Memory systems are evolving into heterogeneous and composable architectures. Heterogeneous and Composable Memory (HCM) offers a feasible solution for terabyte- or petabyte-scale systems, addressing the performance and efficiency demands of emerging big-data applications. The recently announced CXL3.0

Latency

Latency Hardware Cache Architecture

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

In this blog post, we present our project on Auto Remediation, which integrates the currently used rule-based classifier with an ML service and aims to automatically remediate failed jobs without human intervention. Upon further profiling, we found that most of the latency came from the candidate generated step (i.e.,

Tuning

Tuning Efficiency Big Data Engineering

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

In this blog post, we will focus on the latter feature set. The challenge, then, is to be able to ingest and process these events in a scalable manner, i.e., scaling with the number of devices, which will be the focus of this blog post. In particular, the Kafka integration is the most relevant for this blog post.

Latency

Latency Traffic Transportation Cloud

Extending Vector with eBPF to inspect host and container performance

The Netflix TechBlog

FEBRUARY 20, 2019

Today we are excited to announce latency heatmaps and improved container support for our on-host monitoring solution?—?Vector?—?to Remotely view real-time process scheduler latency and tcp throughput with Vector and eBPF What is Vector? to the broader community. Vector is open source and in use by multiple companies.

Performance

Performance Latency Open Source Metrics

Percentiles don’t work: Analyzing the distribution of response times for web services

Adrian Cockcroft

JANUARY 29, 2023

There is no way to model how much more traffic you can send to that system before it exceeds it’s SLA. This is unfortunate, because we’d really like to be able to build systems that have an SLA that we can share with the consumers of our interfaces, and be able to measure how well we are doing.

Lambda

Lambda Latency Cache C++

What is full stack observability?

Dynatrace

APRIL 6, 2022

Observability can identify the baseline user experience and allow teams to improve it by optimizing page load times or reducing latency. Cloud environments present IT complexity challenges that don’t exist in on-premises data centers. appeared first on Dynatrace blog. Why full-stack observability matters.

DevOps

DevOps Innovation Infrastructure Analytics

Observability platform vs. observability tools

Dynatrace

DECEMBER 22, 2021

Complex information systems fail in unexpected ways. Observability gives developers and system operators real-time awareness of a highly distributed system’s current state based on the data it generates. With observability, teams can understand what part of a system is performing poorly and how to correct the problem.

Artificial Intelligence

Artificial Intelligence Metrics Architecture DevOps

Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

John McCalpin

DECEMBER 6, 2016

The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode. This reduces the number of “hops” required for request/response/coherence messages on the mesh, and should reduce both latency and contention.

Latency

Latency Cache Testing Systems

Supercomputing Predictions: Custom CPUs, CXL3.0, and Petalith Architectures

Adrian Cockcroft

JANUARY 20, 2023

on Myths and Legends of High Performance Computing — it’s a somewhat light-hearted look at some of the same issues by the leader of the team that built the Fugaku system I mention below. HPCG is led by Japan’s RIKEN Fugaku system at 16 petaflops, which is 3% of it’s peak capacity. Next generation architectures will use CXL3.0

Architecture

Architecture Latency Benchmarking AWS

How To Measure the Network Impact on PostgreSQL Performance

Percona

JULY 20, 2023

Meanwhile, Hans-Jürgen Schönig’s presentation , which brought up the old discussion of Unix socket vs. TCP/IP connection, triggered me to write about other aspects of network impact on performance. Following wait events are captured from a really fast/low latency network. It is designed to be very lightweight as well.

Network

Network Performance Latency Servers

The Speed of Time

Brendan Gregg

SEPTEMBER 25, 2021

A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. CLI tools The Cassandra systems were EC2 virtual machine (Xen) instances. Microbenchmark os::javaTimeMillis() on both systems. Running this on the two systems saw similar results. Try changing the kernel clocksource.

Speed

Speed Java AWS Virtualization

How To Add eBPF Observability To Your Product

Brendan Gregg

JULY 2, 2021

This is also applicable for people adding it to their own in-house monitoring systems. You likely already have agents running on all your customer systems. There are so many options it's really your own preference based on your existing system and customer environments. biolatency Disk I/O latency histogram heat map.

Latency

Latency Cache Energy Systems

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

Amazon Elastic File System (EFS). Metrics for each service instance are presented in detailed charts—see the example for ECS below. The example below visualizes average latency by API name and stage for a specific AWS API Gateway. The additional services you can now monitor out of the box with Dynatrace are listed below.

AWS

AWS Metrics IoT Storage

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

Amazon Elastic File System (EFS). Metrics for each service instance are presented in detailed charts—see the example for ECS below. The example below visualizes average latency by API name and stage for a specific AWS API Gateway. The additional services you can now monitor out of the box with Dynatrace are listed below.

AWS

AWS Metrics IoT Storage

Growth Engineering at Netflix?—?Automated Imagery Generation

The Netflix TechBlog

FEBRUARY 9, 2021

We need to be able to easily determine what imagery is present for a given platform, region, and language. Server-generated assets, since client-side generation would require the retrieval of many individual images, which would increase latency and time-to-render. Let’s put it all together and review the system interaction diagram.

Engineering

Engineering Storage Latency Entertainment

Growth Engineering at Netflix- Creating a Scalable Offers Platform

The Netflix TechBlog

FEBRUARY 9, 2021

In particular, it’s our job to design and build the systems and protocols that enable customers from all over the world to sign up for Netflix with the plan features and incentives that best suit their needs. For more background on Growth Engineering and the signup funnel, please have a look at our previous blog post that covers the basics.

Engineering

Engineering Scalability Architecture Innovation

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

This article will list some of the use cases of AutoOptimize, discuss the design principles that help enhance efficiency, and present the high-level architecture. Use cases We found several use cases where a system like AutoOptimize can bring tons of value. We will publish a follow-up blog post about AutoAnalyze in the future.

Storage

Storage Latency Efficiency Data Engineering

Netflix Drive

The Netflix TechBlog

MAY 5, 2021

To support such use cases, access control at the user workspace and project workspace granularity is extremely important for presenting a globally consistent view of pertinent data to these artists. 2 , are the file system interface, the API interface, and the metadata and data stores. The major pieces, as shown in Fig.

Media

Media Storage Architecture Cloud

USENIX LISA2021 Computing Performance: On the Horizon

Brendan Gregg

JULY 4, 2021

This was a chance to talk about other things I've been working on, such as the present and future of hardware performance. I also wrote about these topics in detail for my recent [Systems Performance 2nd Edition] book. Note that my predictions in this talk may be wrong, but they should be thought-provoking. Ford, et al., “TCP

Performance

Performance Latency Hardware Storage

ChatGPT vs. MySQL DBA Challenge

Percona

MAY 2, 2023

If you want to know more about how to calculate a good redo log size, check out this blog post: MySQL 8.0 The answer does not consider the queue or latency of the sample, which could indicate a disk with issues. Also, the redo log capacity is not the same as the redo log buffer, so it mixes different technologies and concepts.

Social Media

Social Media Database Servers Cache

How to use Server Timing to get backend transparency from your CDN

Speed Curve

FEBRUARY 5, 2024

Latency – How much time does it take to deliver a packet from A to B. Here are a couple of great blog posts from last year's Web Performance Calendar. Today, it's possible to add these headers from your CDN with ease, if they aren't already set up out of the box. Also measured by round trip time (RTT). Definitely worth a read!

Servers

Servers Cache Retail Benchmarking

Why Traditional Monitoring Isn’t Enough for Modern Web Applications

Dotcom-Montior

MAY 12, 2020

Websites are now more than just the storage and retrieval of information to present content to users. They now allow users to interact more with the company in the form of online forms, shopping carts, Content Management Systems (CMS), online courses, etc. Network latency. Network Latency. The list goes on and on.

Monitoring

Monitoring Entertainment Hardware Latency

Analyzing a High Rate of Paging

Brendan Gregg

AUGUST 29, 2021

1072-aws (xxx) 12/18/2018 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 5.03 avg-cpu: %user %nice %system %iowait %steal %idle 14.81 avg-cpu: %user %nice %system %iowait %steal %idle 14.86 avg-cpu: %user %nice %system %iowait %steal %idle 14.95 avg-cpu: %user %nice %system %iowait %steal %idle 14.54

Cache

Cache C++ AWS Latency

Page Simulator

The Netflix TechBlog

NOVEMBER 12, 2019

To make this happen, we personalize many aspects of our service, including which movies and TV shows we present on each member’s homepage. Over the years, we have built a recommendation system that uses many different machine learning algorithms to create these personalized recommendations. Why Is This Hard?

Metrics

Metrics Government Systems Testing

Jamstack CMS: The Past, The Present and The Future

Smashing Magazine

AUGUST 20, 2021

Jamstack CMS: The Past, The Present and The Future. Jamstack CMS: The Past, The Present and The Future. During the 90s, we saw two content management systems for static sites — Microsoft FrontPage in 1996 and Macromedia Dreamweaver in 1997. Editing a blog post in MovableType 2.0 Mike Neumegen.

Ecommerce

Ecommerce Website Government Internet

SRE Incident Management: Overview, Techniques, and Tools

Dotcom-Montior

DECEMBER 8, 2021

Systems, web applications, servers, devices, etc., SREs and DevOps teams can use these incidents to build back better and improve their systems and services. Now that we have talked about what an incident is, incident management is the process by which teams resolve these events and bring systems and services back to normal operation.

Social Media

Social Media Monitoring Latency DevOps

Page Simulator

The Netflix TechBlog

NOVEMBER 12, 2019

To make this happen, we personalize many aspects of our service, including which movies and TV shows we present on each member’s homepage. Over the years, we have built a recommendation system that uses many different machine learning algorithms to create these personalized recommendations. Why Is This Hard?

Metrics

Metrics Government Systems Testing

Page Simulator

The Netflix TechBlog

NOVEMBER 12, 2019

To make this happen, we personalize many aspects of our service, including which movies and TV shows we present on each member’s homepage. Over the years, we have built a recommendation system that uses many different machine learning algorithms to create these personalized recommendations. Why Is This Hard?

Metrics

Metrics Government Systems Testing

Rebuilding Netflix Video Processing Pipeline with Microservices

Consistent caching mechanism in Titus Gateway

Trending Sources

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Monitoring Distributed Systems

Crucial Redis Monitoring Metrics You Must Watch

Data ingestion pipeline with Operation Management

Edgar: Solving Mysteries Faster with Observability

Improved Alerting with Atlas Streaming Eval

Three Other Models of Computer System Performance: Part 1

Dynatrace accelerates business transformation with new AI observability solution

How To Scale a Single-Host PostgreSQL Database With Citus

Site reliability engineering: 5 things you need to know

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Three Other Models of Computer System Performance: Part 2

USENIX LISA2021 Computing Performance: On the Horizon

Site reliability engineering: 5 things to you need to know

Current status, needs, and challenges in Heterogeneous and Composable Memory from the HCM workshop (HPCA’23)

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

Towards a Reliable Device Management Platform

Extending Vector with eBPF to inspect host and container performance

Percentiles don’t work: Analyzing the distribution of response times for web services

What is full stack observability?

Observability platform vs. observability tools

Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Supercomputing Predictions: Custom CPUs, CXL3.0, and Petalith Architectures

How To Measure the Network Impact on PostgreSQL Performance

The Speed of Time

How To Add eBPF Observability To Your Product

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Growth Engineering at Netflix?—?Automated Imagery Generation

Growth Engineering at Netflix- Creating a Scalable Offers Platform

Optimizing data warehouse storage

Netflix Drive

USENIX LISA2021 Computing Performance: On the Horizon

ChatGPT vs. MySQL DBA Challenge

How to use Server Timing to get backend transparency from your CDN

Why Traditional Monitoring Isn’t Enough for Modern Web Applications

Analyzing a High Rate of Paging

Page Simulator

Jamstack CMS: The Past, The Present and The Future

SRE Incident Management: Overview, Techniques, and Tools

Page Simulator

Page Simulator

Stay Connected