How to Minimize the Margin of Error in an A/B Test

A-B Test showing different content on two computer screens

Often when you encounter statistics in the newspaper, in a report from your marketing team, or on social media, the statistics will include a "margin of error." For example, a political poll might estimate that one candidate will get 58% of the vote "plus or minus 2.8%." That margin of error is one of the most important – and least attended to – aspects of statistics.

In statistics, error is any variability that can't be explained by a model. In mathematical symbols, we would say Y = f(X) + error. In words, we'd say, the dependent variable (what we're interested in predicting) is some function of other variables we're measuring, plus error.

The reason this is called "error" is that when we create a statistical model, we use it to predict our dependent variable. For example, Amazon might run an A/B test where they randomly show a subset of their customers one version of a product page and the remaining customers a different version. They are trying to see if specific aspects of the page affect how much people spend on the product. In this case, Y is the amount spent, and X is the version of the page that they see.

Perhaps, people who see the first page spend an average of $28, and people who see the second page spend an average of $35. If we know that someone saw the first page, and we know nothing else about him or her, our best guess would be that they spent $28. Any difference between what is actually spent and $28 is error (similarly, for people who see the second page, the difference between actual spending and $35 is error).

We always expect some variability across the people in our sample, so we’d expect there to be SOME difference between the people who see the first page and the people who see the second, just by chance. If the errors are distributed in a predictable manner (usually in a bell-shaped curve, or normal distribution), we can estimate how much difference there should be between the two groups, if the page had no effect. If the difference greater than that estimate, we assume that difference is due to which page they saw.

Here are some of the things that contribute to error:

Variables missing from our model

There are a large number of variables that could influence spending, including: Time of year, the economic climate, individual information such as income, and computer-related issues, such as how they found the site and how fast the connection is. If these variables can be easily collected and added to the model, the model would still be Y=f(X) + error, but X would include not only the product page, but all the other information we have, which would likely lead to a better prediction.

Actual mistakes

Maybe the person wants to buy two items, but accidentally hits 22. Oops! Or maybe the analytics engine was configured incorrectly or the dataset got corrupted somewhere along the way through human error or a technical problem. You can minimize the effect of mistakes by taking time to review and clean your data.

Misleading or false information

Maybe the person coming to the site is from a competing retailer, and has no intention of buying the product – they are just visiting the site to do research on the competition. While this source of error is relatively uncommon in behavioral data (such as purchasing a product), it is very common in self-report data.

Respondents often lie about their behavior, their political beliefs, their age, their education, etc. You may be able to correct for this somewhat by looking for strange or anomalous cases and doing the same sort of cleaning you'd do for mistakes. You could also use a self-report scale that estimates various types of misleading information, such as this one.

Random or quasi-random factors

There are a number of factors that can lead to variability that are more or less random. Maybe the person is in a good mood, and so more likely to spend money. Maybe the model on one of the product pages looks like the shopper's 3^rd grade teacher, who they hated, so they navigate away from the page quickly.

Maybe the person's operating system happens to update just as they are getting to the page, and by the time they reboot, they move on to other things. These things probably can't be built into the predictive model, and are difficult to control for, so they will almost always be part of the error.

Bias

So long as errors are basically randomly distributed, we can make a good estimate of how much money visitors will spend and how much this varies between versions. If we have a lot of random error, we may not be able to make a very accurate prediction (our margin of error will be large) but there's no reason it should be wrong one way or the other.

However, systematic error leads to biased data, which will generally give us poor results. For example, if we decide to run one version of the product page for a month, and the other version the next month, the data may be biased based on time. If the first month is December and the second is January, or if there is a major change to the stock market toward the end of the first month, our comparison won't be valid. That's because the people who see the two pages differ systematically.

Therefore, differences in spending between the pages are not due to random chance; some of that difference is due to bias. This makes it impossible to determine how much is due to the differences between the pages. The best way to address this is through good study design. Every single person who comes to the site should be equally likely to go to each page.

It's never possible to completely eliminate error, but well-designed research keeps error as small as possible, and provides a good understanding of error, so we know how confident we can be of the results.

Interested in learning more about Business Analytics, Economics, and Financial Accounting? Our fundamentals of business program, HBX CORe, may be a good fit for you:

About the Author

Jenny is a member of the HBX Course Delivery Team and currently works on the Business Analytics course for the Credential of Readiness (CORe) program, and supports the development of a new course in Management for the HBX platform. Jenny holds a BFA in theater from New York University and a PhD in Social Psychology from University of Massachusetts at Amherst. She is active in the greater Boston arts and theater community, and she enjoys solving and creating diabolically difficult word puzzles.