Percentiles don’t work: Analyzing the distribution of response times for web services (Updated with code)

12 min readJan 29, 2023

Plot showing the final result of fitting multiple normal distributions to a response time curve

Most people have figured out that the average response time for a web service is a very poor estimate of it’s behavior, as responses are usually much faster than the average, but there’s a long tail of much slower responses. The common way to deal with this is to measure percentiles, and track the 90%, 99% response times for example. The difficulty with percentiles is that you can’t combine them, so if you measure percentiles at one minute intervals, you can’t average them together to get hourly or daily percentiles. If you do this, you end up with a number on a dashboard that may be useful as an indicator of trends, but it can’t tell you what the actual percentile was for that hour or day. Another common method is to measure the proportion of responses that exceeded a service level agreement (SLA) expressed as a percentile limit — like 90% of responses within one second. However it’s very difficult to decide what the right SLA is, or to tell how close you are to exceeding it. For example lets say you have a 99% within 2 seconds SLA, and your current 99%ile measured over one minute is 1 second. There is no way to model how much more traffic you can send to that system before it exceeds it’s SLA.

This is unfortunate, because we’d really like to be able to build systems that have an SLA that we can share with the consumers of our interfaces, and be able to measure how well we are doing.

The second problem with percentiles is that response time distributions for anything other than a trivial example aren’t simple shapes with one peak. The structure and behavior of your system will be visible in the responses. The mean and percentile measurements hide this structure, but the rest of this post will show how the structure can be measured and analyzed so that you can figure out a useful model of your system, understand what is driving the long tail of latencies and come up with better SLAs and measures of capacity.

I’ve been thinking about this for a long time. I presented this analysis of response time distributions talk in 2016 — at Microxchg in Berlin (video). A later version of the slides is included in my Microservices Workshop deck from later that year, slides 168–200 (pdf, keynote are available in GitHub.com/adrianco/slides).

Bill Kaiser of NewRelic published this blog in 2017 which goes some way towards what I’m talking about, but since then I have figured out a new way to interpret the data.

What Is the Expected Distribution of Website Response Times?

When you're trying to optimize the performance for your website, there's one question that comes up a lot: What is the…

newrelic.com

If we can obtain histograms of the response time distribution of a web site or API over a time period, I think there is interesting underlying structure that can be extracted and summarized.

Response times are a one sided distribution. There’s a minimum possible response time for when everything goes right and there’s no delays. Every opportunity for delay due to more work than the best case or more time waiting than the best case increases the latency and they all add up and create a long tail. This applies to other areas like software project time estimates as well, there’s a tendency to focus on the best case but it always takes longer.

Using the high dynamic range hdrhistogram package to collect response time data the bins are assigned on a logarithmic basis, so components that are approximated by long tail lognormal distributions appear as symmetric normal components when the bins are plotted.

I spent a long time trying to figure out how to do this analysis at all, was hopeful that it was an existing technique if I could only find an example. In the end as a side effect of a discussion on Mastodon (I’m @adrianco@mastodon.social) with Donnie Berkholz (@dberkholz@hostux.social) where I was trying to explain what I was looking for, Ed Borasky (@AlgoCompSynth@ravenation.club) joined in the discussion and knew exactly what I needed. He gave me a few options, and first I looked at the R mixtools library. The documentation is written for people that understand hardcore statistics with lots of greek symbols. I have enough math and stats background that I can muddle along and figure out how to use it, so I’m documenting that here without the greek. What we needed was an Expectation Maximization (EM) algorithm for obtaining a finite mixture model of the data. Each peak in the data is modeled by a normal distribution, and the EM algorithm iterates until it finds the best fit it can for some number of components. We can let the algorithm start with no guidance and take longer to find the peak, or we can specify starting guesses and the number of components we want, which will be useful when tracking changes in response distributions over time. This package assumes we have all the underlying data that was fed into the histogram, for example from a log file that records the response time of each individual request.

The code is written using the R statistics language and I downloaded a free copy of RStudio to run the examples below.

> install.packages("mixtools")
> library(mixtools)
> data(faithful) #sample data of old failful geyser eruption wait times
> attach(faithful)
> hist(waiting) # draw a histogram, which shows two peaks that look symmetric

Wait time between geyser eruptions as a histogram

Next we run the model and see the results.

> wait <- normalmixEM(waiting)
number of iterations= 39 
> summary(wait)
summary of normalmixEM object:
          comp 1    comp 2
lambda  0.360887  0.639113
mu     54.614896 80.091095
sigma   5.871247  5.867714
loglik at estimate:  -1034.002 
> plot(wait, density=T)
Hit <Return> to see next plot: 
Hit <Return> to see next plot:

Lambda is the probability density, in other words the proportional weight of each underlying component normal distribution — which is summed to 1.0. Mu is the mean of each component, the latency. Sigma is the width of each component, the variance. Loglik is the quality of the fit, the log-likelyhood of the Expectation Maximization algorithm. It can range from minus to plus infinity, and the more positive it is the better the fit. The normalmixEM function assumes that it’s looking for a mix of normal distributions. There are other functions in the package for more complex situations.

EM iterations converging after 39 iterations

The same result is obtained more quickly with fewer iterations when good initial estimates are provided. The initial loglik value is higher but trends to the same result. Without any guidance, random numbers are used to start with, and the number of iterations depends on the actual random numbers used for that run. For this demo on an old MacBook (2.7 GHz Dual-Core Intel Core i5), it takes between 2–6ms, around 0.2ms per iteration. By relaxing the epsilon loglik improvement requirement even fewer iterations are needed to get a good result. This seems reasonable overhead for a real time algorithm that could be applied to histogram data as part of a metric collection pipeline.

> system.time(wait1 <- normalmixEM(waiting, mu=c(50,80), lambda=.5, sigma=6))
number of iterations= 10 
   user  system elapsed 
  0.002   0.000   0.002 
> summary(wait1)
summary of normalmixEM object:
          comp 1    comp 2
lambda  0.360849  0.639151
mu     54.613614 80.090295
sigma   5.869094  5.869094
loglik at estimate:  -1034.002 

> system.time(wait2 <- normalmixEM(waiting, mu=c(50,80), lambda=.5, sigma=6, epsilon=0.1))
number of iterations= 4 
   user  system elapsed 
  0.001   0.001   0.001 
> summary(wait2)
summary of normalmixEM object:
          comp 1    comp 2
lambda  0.360256  0.639744
mu     54.596070 80.076551
sigma   5.874538  5.874538
loglik at estimate:  -1034.003

The geyser seems to have a normal distribution of inter-arrival time, but for computer systems it’s going to be more skewed and we can try to superimpose several log-normal distributions to get a fit.

For high traffic systems, processing the individual response times for each request may be too much work. If instead they are fed into an hdrhistogram that is processed at regular intervals, then we need to process the histogram itself and that takes me down several rat-holes. There’s an R package called mixComp that looked useful, and another one called mixdist, but they wanted an initial estimate of the number and position of the peaks to be fitted, and they gave a lot of errors even when I tried them out with good looking data. So in the end I decided to write my own peak estimation code that starts with a log-binned histogram and finds all the peaks. The starting point is a peak finding algorithm, and Ed helped me again by pointing me at Pracma’s Findpeaks routine.

I then used two sample histograms from the same system to develop an algorithm in R I call as.peaks. It takes a histogram, and returns a data frame containing the peaks. This took a lot of tweaking and iterating until I came up with what looks like a reliable algorithm. The basic structure is that Findpeaks is used to locate which bins are peaks, then starting with the biggest, each of those peaks has a better estimate made using a spline fit to the top 5 points, and that estimate is improved by attempting to fit a normal distribution. The normal fitted peaks are then subtracted out of the data. Once all the peaks have been removed, new peaks are revealed that were hiding between the others, and a second pass is made using Findpeaks and processed the same way. The normal fit fails sometimes, particularly on the smaller peaks, leaving the spline as a fallback.

Here’s the first of the histograms, and you can see that it has a lot of structure. I’ve cropped the y-axis in these plots as it’s the shape that’s important, however this represents a very large number of data points collected over a short timeframe, something like millions of points for one minute.

My first attempt to analyze this was to manually guess where to put each normal distribution and subtract them out one at a time. The first fit and the residual after it’s subtracted is shown below.

The second fit goes after the next highest peak.

After the third peak is removed, it’s clear there are at least three more interesting peaks, but after a while some peaks are artifacts of how the previous peaks were removed. That’s why an algorithm is needed to take out the guesswork.

I think that the structure of the system in terms of functionality distributed across a network of lambda functions, microservices, caches, databases and external services is going to determine the number and position of each peak. Changes in behavior of the system from minute to minute is going to change the height of each peak, as the workload mix and cache hit rates change. New deployments of code and infrastructure, or unusual activity like failure modes could change the characteristic positions of each peak.

If this works out then a small amount of data — three numbers for each significant component in the mix— is going to act as a very compressed yet highly accurate representation of a response time distribution. It’s much more useful from a modelling and analysis point of view than the mean response time or a percentile. In addition distributions can be combined over time to summarize, and mathematically correct percentiles can be derived from them. If we sort the list by latency, feed that list into the estimator for the next mix and allow for small changes in the mean latency of each component, then any sudden large changes in the structure are effectively a reset for the model, likely due to a code push, or a failure mode.

The code is about 100 lines of R plus many comments (see this github repo), and it optionally prints a sequence of plots and debug output as shown below.

# Take a histogram h$Value h$Count and find the peaks in it, interpolating between bins
# approximate finite mixed model decomposition starting with histograms of data
# Intended for use on response time histograms with logarithmic bins like hdrhistogram
# so that each (approximately) lognormal component is interpreted as a symmetric normal peak
library(pracma) # for findpeaks

# h - is a data frame containing Values and Counts forming a histogram where Values for each bucket are exponential
# plots - optionally shown
# normalize - divides down the counts to a probability density
# peakcount=0 does a quick estimate of all the peaks, otherwise it does up to the specified number of dnorm fits
# epsilon=0.005 - sets a minimum interesting peak size to 0.5% of the largest peak, the ones that are visible in the plot.
# printdebug turns on helpful intermediate print output to see what's going on

# Returns a data frame with one row per peak, sorted biggest peak first
# row names show the order the peaks were discovered in
# PeakDensity is the histogram count or normalized density
# PeakBucket is the center bucket for the peak
# PeakMin and PeakMax are the start and end buckets for the peak
# if a gaussian fit was obtained, then PeakAmplitude is non-zero, along with updated PeakMean and PeakSD
# The Value of the peak is interpolated to estimate PeakLatency
as.peaks <- function(h, plots=FALSE, normalize=FALSE, epsilon=0.005, peakcount=0, printdebug=F)

The first plot (for the same histogram shown above) shows the actual distribution and the mean, on linear axes. This system has some responses in a fraction of a second, an average of just over a second, a few clear peaks and a long tail extending beyond 40 seconds.

The next plot is the same histogram I already showed, then it plots the result. Each peak and the distribution it’s being subtracted from is in the same color. The two largest, black and red are removed first, then the left-most green one and the small one at the 18th bucket. This reveals additional peaks which are processed in the second pass, only looking for peaks that are larger than the smallest we picked the first time, to avoid focusing on noise.

Trying this on my second sample histogram the response time shows some a slightly lower mean latency.

The distribution shows more structure.

And the fitted peaks cover the sharp low latency ones in a plausible manner.

> as.peaks(hist1, normalize=T, peakcount=10)
  PeakDensity PeakBucket PeakMin PeakMax  PeakMean    PeakSD PeakAmplitude PeakLatency
1 0.115124638         54      46      70 54.241346 2.1974776   0.642060169  1.57129257
2 0.045773931         42      29      46 42.192548 1.9606526   0.228603641  0.49833906
5 0.014108909         49      44      53 48.614035 2.1417907   0.078241384  0.91902547
6 0.011062200         37      29      40 36.732429 1.3955436   0.040165053  0.29615288
3 0.007798191          7       2      14  7.647565 0.9801163   0.020434690  0.01851891
7 0.002867469         10       8      17 10.244946 0.8583217   0.005929608  0.02372072
8 0.002167271         55      54      56 55.000000 0.0000000   0.000000000  1.68911714
4 0.001277573         18      14      21 17.528146 1.4389079   0.004426705  0.04748966

This shows the output for the first histogram plotted above, with the counts normalized so that the area under the histogram is 1.0.

I’m still working this, and I’m looking for additional histograms from other systems to try it out on. So if anyone has logarithmic response time data histograms or logs of raw response time they could share, please let me know…

I will update this post with additional examples, the code is on GitHub, and I’ve submitted a talk proposal to present this work at the Monitorama conference in June.

Percentiles don’t work: Analyzing the distribution of response times for web services (Updated with code)

What Is the Expected Distribution of Website Response Times?

When you're trying to optimize the performance for your website, there's one question that comes up a lot: What is the…

Written by adrian cockcroft