Availability, Latency, Traffic and Tuning - Technology Performance Pulse

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. This approach has a handful of benefits.

Traffic

Traffic Latency Tuning Systems

Bending pause times to your will with Generational ZGC

The Netflix TechBlog

MARCH 5, 2024

Reduced tail latencies In both our GRPC and DGS Framework services, GC pauses are a significant source of tail latencies. Each of these errors is a canceled request resulting in a retry so this reduction further reduces overall service traffic by this rate: Errors rates per second.

Latency

Latency Java Tuning Efficiency

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

Key Takeaways Critical performance indicators such as latency, CPU usage, memory utilization, hit rate, and number of connected clients/slaves/evictions must be monitored to maintain Redis’s high throughput and low latency capabilities. Similarly, an increased throughput signifies an intensive workload on a server and a larger latency.

Metrics

Metrics Monitoring Latency Cache

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Percona

SEPTEMBER 1, 2023

While there is no magic bullet for MySQL performance tuning, there are a few areas that can be focused on upfront that can dramatically improve the performance of your MySQL installation. What are the Benefits of MySQL Performance Tuning? A finely tuned database processes queries more efficiently, leading to swifter results.

Tuning

Tuning Database Performance Hardware

Automated observability, security, and reliability at scale

Dynatrace

JULY 18, 2023

Whether tracking internal, workload-centric indicators such as errors, duration, or saturation or focusing on the golden signals and other user-centric views such as availability, latency, traffic, or engagement, SLOs-as-code enables coherent and consistent monitoring throughout the environment at scale.

Best Practices

Best Practices Code Infrastructure Latency

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

For that, we focused on OpenTelemetry as the underlying technology and showed how you can use the available SDKs and libraries to instrument applications across different languages and platforms. Here, we can find statistics on the overall availability of the database, connections, queries, and errors. What is OneAgent?

Metrics

Metrics Monitoring Database Network

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

For example, when running tests, the state of the device will change from “available for testing” to “in test.” As such, we can see that the traffic load on the Device Management Platform’s control plane is very dynamic over time. Over the lifecycle of a device connected to the RAE, the device can change attributes at any time.

Latency

Latency Traffic Transportation Hardware

Turbocharge Your Content Delivery With CDN Multiple Origins Load Balancer!

IO River

NOVEMBER 2, 2023

â€Just as a well-coordinated airport directs flights to multiple runways based on traffic and weather conditions, a CDN with Multiple Origins Load Balancing ensures that web traffic is distributed across various data centers, optimizing performance and reliability. â€But how does it decide where to send this traffic?

Traffic

Traffic Cache Servers Latency

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

DEM provides an outside-in approach to user monitoring that measures user experience (UX) in real time to ensure applications and services are available, functional, and well-performing across all channels of the digital experience, including web, mobile, and IoT.

Monitoring

Monitoring Social Media IoT Metrics

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 2 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Picture yourself enthralled by the latest episode of your beloved Netflix series, delighting in an uninterrupted, high-definition streaming experience. This is where large-scale system migrations come into play.

Traffic

Traffic Metrics Systems Strategy

Turbocharge Your Content Delivery With CDN Multiple Origins Load Balancer!

IO River

NOVEMBER 2, 2023

Just as a well-coordinated airport directs flights to multiple runways based on traffic and weather conditions, a CDN with Multiple Origins Load Balancing ensures that web traffic is distributed across various data centers, optimizing performance and reliability. But how does it decide where to send this traffic?

Traffic

Traffic Cache Network Servers

CDN Web Application Firewall (WAF): Your Shield Against Online Threats

IO River

NOVEMBER 15, 2023

In technical terms, network-level firewalls regulate access by blocking or permitting traffic based on predefined rules. â€At its core, WAF operates by adhering to a rulebookâ€”a comprehensive list of conditions that dictate how to handle incoming web traffic. You've put new rules in place. Let's dive deep into these challenges.â€1.

Traffic

Traffic Network Logistics Architecture

Meet Hydrogen: A React Framework For Dynamic, Contextual And Personalized E-Commerce

Smashing Magazine

NOVEMBER 8, 2021

As developers, we rightfully obsess about the customer experience, relentlessly working to squeeze every millisecond out of the critical rendering path, optimize input latency, and eliminate jank. Surveying the existing landscape of available developer tools and runtimes, we felt that there is a gap. Stay tuned for more in 2022!

Cache

Cache Best Practices Strategy Servers

CDN Web Application Firewall (WAF): Your Shield Against Online Threats

IO River

NOVEMBER 2, 2023

In technical terms, network-level firewalls regulate access by blocking or permitting traffic based on predefined rules. At its core, WAF operates by adhering to a rulebook—a comprehensive list of conditions that dictate how to handle incoming web traffic. You've put new rules in place. Let's dive deep into these challenges.‍1.

Traffic

Traffic Network Logistics Architecture

MongoDB Best Practices: Security, Data Modeling, & Schema Design

Percona

APRIL 17, 2023

The main objective of this post is to share my experience over the past years tuning MongoDB and centralize the diverse sources that I crossed in this journey in a unique place. There is an issue with this, which causes the OS to swap even with memory available. Spoiler alert: This post focuses on MongoDB 3.6.X 25.84 - Total 21.04

Best Practices

Best Practices Design Tuning Database

The Speed of Time

Brendan Gregg

SEPTEMBER 25, 2021

A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. Since instances of both CentOS and Ubuntu were running in parallel, I could collect flame graphs at the same time (same time-of-day traffic mix) and compare them side by side.

Speed

Speed Java AWS Virtualization

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

The control group’s traffic utilized the legacy Falcor stack, while the experiment population leveraged the new GraphQL client and was directed to the GraphQL Shim. To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render. How does it work?

Traffic

Traffic Latency Cache Metrics

Performance Testing - Tools, Steps, and Best Practices

KeyCDN

AUGUST 15, 2019

Before you begin tuning your website or application, you must first figure out which metrics matter most to your users and establish some achievable benchmarks. Just because everything works perfectly during production testing doesn’t mean that will be the case when your website is flooded with traffic.

Testing Tools

Testing Tools Best Practices Performance Testing Testing

The Speed of Time

Brendan Gregg

SEPTEMBER 25, 2021

A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. Since instances of both CentOS and Ubuntu were running in parallel, I could collect flame graphs at the same time (same time-of-day traffic mix) and compare them side by side.

Speed

Speed Java AWS Virtualization

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure By Manuel Correa , Arthur Gonigberg , and Daniel West Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. Logs and background requests are examples of this type of traffic.

Traffic

Traffic Metrics Infrastructure Architecture

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Telltale learns what constitutes typical health for an application, no alert tuning required. Regional traffic evacuations. A regional traffic shift means one region ends up with zero traffic while another region has double.

Monitoring

Monitoring Tuning Traffic Metrics

Zero Configuration Service Mesh with On-Demand Cluster Discovery

The Netflix TechBlog

AUGUST 29, 2023

Since there were no existing solutions available, we needed to build them ourselves. To improve availability, we designed systems where components could fail separately and avoid single points of failure. Our internal IPC traffic is now a mix of plain REST, GraphQL , and gRPC.

Traffic

Traffic Latency Cloud C++

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

We thus assigned a priority to each use case and sharded event traffic by routing to priority-specific queues and the corresponding event processing clusters. This separation allows us to tune system configuration and scaling policies independently for different event priorities and traffic patterns.

Systems

Systems Traffic Architecture Mobile

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

From the moment a Netflix film or series is pitched and long before it becomes available on Netflix, it goes through many phases. Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain.

Big Data

Big Data Government Analytics Processing

Solaris to Linux Migration 2017

Brendan Gregg

SEPTEMBER 5, 2017

What follows are topics that may be of interest to anyone looking to migrate their systems and skillset: scan these to find topics that interest you. ## ZFS ZFS is available for Linux via the [zfsonlinux] and [OpenZFS] projects, and more recently was included in Canonical's Ubuntu Linux distribution: Ubuntu Xenial 16.04 LTS (April 2016).

Virtualization

Virtualization AWS Engineering Hardware

Netflix Video Quality at Scale with Cosmos Microservices

The Netflix TechBlog

NOVEMBER 2, 2021

This enables us to use our scale to increase throughput and reduce latencies. Here, based on the video length, the throughput and latency requirements, available scale etc., The quality results are now available to the caller via the getQuality endpoint. Stay tuned for more details on these algorithmic innovations.

Media

Media Innovation Metrics Latency

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

Nonetheless, we found a number of limitations that could not satisfy our requirements e.g. stalling the processing of log events until a dump is complete, missing ability to trigger dumps on demand, or implementations that block write traffic by using table locks. Designed with High Availability in mind. Writing events to any output.

Database

Database Traffic Transportation Open Source

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

Nonetheless, we found a number of limitations that could not satisfy our requirements e.g. stalling the processing of log events until a dump is complete, missing ability to trigger dumps on demand, or implementations that block write traffic by using table locks. Designed with High Availability in mind. Writing events to any output.

Database

Database Traffic Transportation Open Source

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

All Things Distributed

OCTOBER 2, 2017

We were pushing the limits of what was a leading commercial database at the time and were unable to sustain the availability, scalability and performance needs that our growing Amazon business demanded. Durable and Highly-Available – DynamoDB maintains data durability and 99.99

Internet

Internet Internet AWS Performance

HTTP/3: Performance Improvements (Part 2)

Smashing Magazine

AUGUST 22, 2021

Because we are dealing with network protocols here, we will mainly look at network aspects, of which two are most important: latency and bandwidth. Latency can be roughly defined as the time it takes to send a packet from point A (say, the client) to point B (the server). Two-way latency is often called round-trip time (RTT).

Performance

Performance Network Latency Servers

HTTP/3: Practical Deployment Options (Part 3)

Smashing Magazine

SEPTEMBER 6, 2021

Finally, not inlining resources has an added latency cost because the file needs to be requested. As such, a micro-optimization is, again, how you probably need to fine-tune things on a low level to really benefit from it. Note that there is an Apache Traffic Server implementation, though.). What Does It All Mean?

Network

Network Servers Cache Traffic

HTTP/3 From A To Z: Core Concepts (Part 1)

Smashing Magazine

AUGUST 9, 2021

You’ve probably heard things like: “HTTP/3 is much faster than HTTP/2 when there is packet loss”, or “HTTP/3 connections have less latency and take less time to set up”, and probably “HTTP/3 can send data more quickly and can send more resources in parallel”. TLS, TCP, and QUIC handshake durations ( Large preview ).

Transportation

Transportation Internet Internet Network

Technology Performance Pulse

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Bending pause times to your will with Generational ZGC

Trending Sources

Crucial Redis Monitoring Metrics You Must Watch

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Automated observability, security, and reliability at scale

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Towards a Reliable Device Management Platform

Turbocharge Your Content Delivery With CDN Multiple Origins Load Balancer!

How digital experience monitoring helps deliver business observability

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Turbocharge Your Content Delivery With CDN Multiple Origins Load Balancer!

CDN Web Application Firewall (WAF): Your Shield Against Online Threats

Meet Hydrogen: A React Framework For Dynamic, Contextual And Personalized E-Commerce

CDN Web Application Firewall (WAF): Your Shield Against Online Threats

MongoDB Best Practices: Security, Data Modeling, & Schema Design

The Speed of Time

Migrating Netflix to GraphQL Safely

Performance Testing - Tools, Steps, and Best Practices

The Speed of Time

Keeping Netflix Reliable Using Prioritized Load Shedding

Telltale: Netflix Application Monitoring Simplified

Zero Configuration Service Mesh with On-Demand Cluster Discovery

Rapid Event Notification System at Netflix

Data Movement in Netflix Studio via Data Mesh

Solaris to Linux Migration 2017

Netflix Video Quality at Scale with Cosmos Microservices

DBLog: A Generic Change-Data-Capture Framework

DBLog: A Generic Change-Data-Capture Framework

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

HTTP/3: Performance Improvements (Part 2)

HTTP/3: Practical Deployment Options (Part 3)

HTTP/3 From A To Z: Core Concepts (Part 1)

Stay Connected