Summary of common mistakes made in A/B testing

August 13, 2024, 12:30 p.m.

On the official blog of

PostHog, a service that offers a variety of analytics and research tools, including A/B testing, engineer Lior Neuner has compiled a list of typical mistakes that are often made when conducting A/B testing.

A/B testing mistakes I learned the hard way
https://newsletter.posthog.com/p/ab-testing-mistakes-i-learned-the

Neuner said he has conducted hundreds of A/B tests while working at Meta and on personal projects. He listed six common mistakes he learned the hard way from hundreds of experiences.

◆1: The hypothesis to be tested is unclear
When conducting tests, the hypothesis should be clear about what is being tested and why. If the hypothesis is unclear, you will not only waste time, but you may also unknowingly make changes that harm the product.

As an example of a bad hypothesis, Neuner cited the hypothesis that “changing the color of the checkout button will increase purchases.” In this hypothesis, it is unclear why this change will increase purchases, and it is unclear whether only the number of button clicks should be measured or what other indicators should be considered.

A good hypothesis, on the other hand, is something like, “User research has shown that users don’t know how to get to the purchase page. If you change the color of the button, more users will notice and go to the purchase page. This will lead to more purchases.” Following this hypothesis, you can see that all you need to do is check the number of button clicks and the number of purchases.

To avoid this mistake, make sure your hypothesis answers three questions: “Why are you running the test?”, “What changes are you testing?” and “What is expected to happen?”

◆2: Show only the overall results
Let’s say you run an A/B test with all users and get the following results: Just looking at the results, it seems like the modified “test” version doubled the conversion rate.

However, when we look at the results by device type, we see that on desktop devices, not changing the settings resulted in a higher conversion rate.

Because there is a possibility that “

Because Simpson’s Paradox occurs and results for one group do not necessarily match the overall results, the results of the experiment should be verified by breaking them down by user attributes such as device type, price range, new or repeat user, and acquisition channel.

◆ 3: Include unaffected users in the experiment
Including users who do not have access to the feature you are testing or who have already achieved your goal may skew the results, change your conclusions, or increase the duration of the experiment.

For example, if you receive the flag from the A/B testing tool before checking whether the user should participate in the experiment as shown below, the flag assigned by the tool will be different from the actual user display.

To ensure proper results, the A/B testing tool flag should be pulled after other tests.

4. Finish the test quickly
Neuner calls this error the “peek problem.” It occurs when you look at the results of a test before it is completed and make a decision based on incomplete data. Even if the initial results are statistically significant, you don’t know if the final results will be statistically significant.

When conducting A/B testing, you should calculate and stick to the required time period in advance. Some tools will automatically calculate the required time period for you.

◆5: Conduct an experiment on the entire population immediately
If you run an experiment from the beginning with all users to get results quickly, you may not be able to run the experiment if there is an error. For example, if you start an A/B test with all users and then find that a change you made crashes the app and makes it impossible to get correct results, you won’t be able to run the A/B test again even if you fix the problem because many users have already seen the change.

It is recommended to start with an experiment with a small number of users and gradually increase the number of participating users while checking whether the app is working properly using various data.

◆6: Ignore counter metrics
A counter metric is a metric that can be indirectly affected by an experiment. For example, if you’re testing a change to your account registration page, a counter metric could be the number of active users. If your registration rate increases but the number of active users remains the same, it may be that your change to the page is misleading users about the functionality of the app, which is why your abandonment rate is also increasing.

By setting and tracking indicators to verify product health, such as “user retention rate,” “session duration,” and “number of active users,” you can ensure that there are no unexpected side effects that adversely affect the changes. In addition, “hold-out tests,” where a small number of users continue to see the pre-change version even after the experiment is completed, are also effective for studying long-term effects.

“A/B testing is powerful because it allows you to see if your product is improving as a result of the changes you make. But it’s also scary because there are so many ways to screw up the test,” says Neuner, emphasizing the importance of doing A/B testing correctly.