article thumbnail

Percentiles don’t work: Analyzing the distribution of response times for web services

Adrian Cockcroft

The mean and percentile measurements hide this structure, but the rest of this post will show how the structure can be measured and analyzed so that you can figure out a useful model of your system, understand what is driving the long tail of latencies and come up with better SLAs and measures of capacity.

Lambda 98
article thumbnail

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls. Using simple lookup indices in Cassandra gives us the ability to maintain acceptable read latencies while doing heavy writes.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Tuning SQL Server Reporting Services

SQL Performance

Many database administrators find themselves having to support instances of SQL Server Reporting Services (SSRS), or at least the backend databases that are required for SSRS. In each of the deployment models, the role of the database administrator is to make sure that SSRS is stable, dependable, and recoverable.

Tuning 67
article thumbnail

The Speed of Time

Brendan Gregg

A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. top(1) showed that only the Cassandra database was consuming CPU. I've shared many posts about superpower observability tools, but often humble hacking is just as effective. These can be invisible to top(8).

Speed 126
article thumbnail

Virtual consensus in Delos

The Morning Paper

Back in 2017 the engineering team at Facebook had a problem. The initial version of Delos went into production after eight months using a ZooKeeper-backed Loglet implementation, and then four months later it was swapped out for a new custom-built NativeLoglet that gave a 10x improvement in end-to-end latency. Every little helps!

article thumbnail

The Speed of Time

Brendan Gregg

A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. top(1) showed that only the Cassandra database was consuming CPU. I've shared many posts about superpower observability tools, but often humble hacking is just as effective. These can be invisible to top(8).

Speed 40
article thumbnail

The Performance Inequality Gap, 2021

Alex Russell

TL;DR: A lot has changed since 2017 when we last estimated a global baseline resource per-page resource budget of 130-170KiB. To update our global baseline from 2017, we want to update our priors on a few dimensions: The evolved device landscape. The Moto G4 , for example. Here begins our 2021 adventure. Hard Reset.