Remove randomness-software-estimates
article thumbnail

Sequential A/B Testing Keeps the World Streaming Netflix Part 1: Continuous Data

The Netflix TechBlog

These observations are from a particular type of A/B test that Netflix runs called a software canary or regression-driven experiment. Netflix also performs canary tests — software A/B tests between current and newer software versions. Strictly control false positive (false alarm) probabilities.

Testing 239
article thumbnail

Randomness in Software Estimates

Professor Beekums

(..)

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Experimentation is a major focus of Data Science across Netflix

The Netflix TechBlog

A Type-M error occurs when, given that we observe a statistically-significant result, the size of the estimated metric movement is magnified (or exaggerated) relative to the truth. A Type-M error means that we are over-estimating the impact of the treatment. Combined, these two effects reduce the risk of Type-S and Type-M errors.

article thumbnail

Percentiles don’t work: Analyzing the distribution of response times for web services

Adrian Cockcroft

Plot showing the final result of fitting multiple normal distributions to a response time curve Most people have figured out that the average response time for a web service is a very poor estimate of it’s behavior, as responses are usually much faster than the average, but there’s a long tail of much slower responses.

Lambda 98
article thumbnail

Building confidence in a decision

The Netflix TechBlog

Even if results are statistically significant (p-value < 0.05), the estimated metric movements may be so small that they are immaterial to the Netflix member experience, and we are better off investing our innovation efforts in other areas. Similar considerations are relevant when interpreting results. Do results repeat ?

Metrics 204
article thumbnail

PlanAlyzer: assessing threats to the validity of online experiments

The Morning Paper

Our checks are based on well-known problems that arise in experimental design and causal inference… PlanAlyzer checks PlanOut programs for a variety of threats to internal validity, including failures of randomization, treatment assignment, and causal sufficiency. PlanOut itself has been ported many programming languages at this point.

article thumbnail

Ensuring Performance, Efficiency, and Scalability of Digital Transformation

Alex Podelko

So here is the list of 21 sessions on my “to attend” list (check the full agenda as you may be interested in another topics and technologies – and there many more great sessions there) – in the same random order they are in the list of sessions). How is DevOps changing the Modern Software Development Landscape? ,