AB Testing — So You Know What Really Works

Published in

The Startup

5 min readMar 21, 2020

‎⁨Castillo San Cristóbal⁩, ⁨San Juan⁩, picture by Adrian

When I joined Netflix in 2007 I was managing a team that built the personalized home page for the DVD shipping web site. The first thing I found was that every user visible change we made went through an A/B test. This was institutionalized across Netflix from the beginning, and it’s part of the formation story told in Marc Randolph’s book That Will Never Work. As an experienced product manager and the founding CEO, he wanted to be able to measure what was really working. An oft-quoted study by Microsoft Bing found that of the changes they test, one-third prove effective, one-third have neutral results, and one-third have negative results. The problem is that you have to do all the tests to find which third to keep.

Now, when I’m talking to companies about personalization, I find they are all excited about the latest machine learning algorithms, but very few seem to have got the basics in place to be able to tell what’s working and what isn’t.

There’s quite a lot of online information about A/B testing and tools, but I want to concentrate on the basics.

You need an “abtest” database service. In a small system this may be a table in a central relational database, but it’s best setup as a NoSQL data source using something like Amazon DynamoDB or Apache Cassandra, with a caching client library or at large scale a microservice data access layer. The table contains a row for every customer, that says what tests they are in and what test cell they are in, a timestamp for when they were added, and a flag for whether the test is active. Whenever a customer connects to your service, you read the entire row and see what test experiences they should be in. Each customer should be in a small number of tests. The abtest service should be used across your entire system, for all experiments that touch a customer.

Customer service should be able to tell what tests a customer is in, and report problems that seem to be correlated with tests. Individual customers can be dropped from a test and if it happens a lot the test should be shut down quickly.

There are three ways to populate the database. The best source of test data is new users who haven’t seen the product before, so tests should be allocated as part of the signup flow for new customers. Excluding customers who are re-joining the service is a good idea. New customers provide the most sensitive and unbiased results, and are often most active as they explore the features of the product. The second way is feature specific, and is activated when a customer hits a specific condition, for example at Netflix, when a customer activates a specific device (e.g. an iPad) they may be allocated to a test that is specific for that device. The third way is to randomly select a group of existing customers. This helps make sure your changes aren’t confusing people who were used to the previous functionality.

Successful tests should be turned on for everyone by replacing the abtest with a feature flag, but in the case where there are some worries about the impact it’s also useful to use a hold-back test. In this case everyone gets the new experience except a test cell that keeps the old experience to be sure that it really is better for everyone.

The number of users to allocate to a test depends on several factors (and without getting into the statistics theory) more users will find a smaller difference more quickly with higher confidence. The stream of new user signups to your service should be considered as a precious resource that is fully allocated to tests, according to discussions across product management about which test is highest priority for quick answers. If you think about the scale that Netflix operates at, and the number of new customers they get every day, they can run far more and bigger tests than their smaller competitors, so they have a scale advantage that they can use to tune their product, and maintain a higher rate of innovation.

Let’s say you have several million subscribers to a service and are adding about a million new users per quarter, that’s about 10,000 new users per day. (For reference Netflix added almost 100,000 new users per day in Q4 2019). If you want to allocate a test that has four alternatives and a control group, you would split new users five ways at random and put 2000 per day in each test cell until you get to 10,000 per cell. Then you wait several weeks for statistically significant differences to emerge, and shut down the test. At that scale you can test a handful of new things each month. Existing user tests should have much larger allocations, perhaps 50,000 users per test cell, and should be selected from users that aren’t currently in a test at all, or are in tests that are very unlikely to interact with each other.

I think its good practice to have a big monthly A/B test meeting of all the product managers and engineering managers, including executives. That’s where test results are presented, and there will be a lot of counter-intuitive results, head-scratching, and useful suggestions from the audience. Decisions on what needs a follow-on test, what to kill and what to turn on for all customers should be made there, and brand new ideas for tests shared and prioritized.

The mark of a good product manager is that they have useful intuition and judgement on what would be a good thing to test. The engineering manager’s role is to find the fastest and easiest way to implement that test. The data scientist’s role is to make sure the test results make sense, and were collected correctly, and determine whether there is a significant difference — a non-overlapping confidence interval.

These practices are the basis for hypothesis driven development, as described in the book Lean Enterprise. Nowadays this should be a baseline capability if you are trying to figure out how to improve an online service, or tune a personalization algorithm.

AB Testing — So You Know What Really Works

Written by adrian cockcroft