The Basic Idea
If you run an online store and want to increase sales, you might think about redesigning your product page. But you’re not sure where to start. Do you rewrite the product description? Upload fresh product photos? Change the size and color of your “add to cart” button? Making all these changes at once is a lot like experimenting with a recipe without tasting it along the way. Perhaps it turns out great, or maybe it’s a total disaster. Either way, you’ll never know what caused the outcome.
With A/B testing, you can measure the impact of each individual change on user behavior. You will learn what works and, just as crucially, what doesn’t. For example, you might discover that changing your product photos improves conversion rates, but writing new product descriptions has the opposite effect.
A/B testing is a method for comparing two versions of a webpage, app, email, or advertisement to determine which one performs better. It’s also called “split testing” because it involves splitting your audience into two groups and showing them the different versions (called variants) of whatever it is you’re testing. One group views variant A, and the other views variant B. By comparing how these two variants perform, you can easily see which performs best.
Theory, meet practice
TDL is an applied research consultancy. In our work, we leverage the insights of diverse fields—from psychology and economics to machine learning and behavioral data science—to sculpt targeted solutions to nuanced problems.
Variant: Each version being tested, often called variant A and variant B. Typically, variant A is the original version as it currently exists while variant B is the version with the change(s) you want to test.1
Conversion Rate: The percentage of users who take a desired action, such as clicking a link or filling out a form. Often, A/B tests use conversion rates to compare the performance of the two variants being measured.
Control Group: The group of users exposed to the existing or original version (variant A). This group establishes the baseline for comparison, confirming or denying if variant B produces an effect.
Randomization: The process of randomly assigning users to the different variants. This minimizes the effect of individual factors outside of your control (which can interfere with the accuracy of the results).
Statistical Significance: The likelihood that the results of the A/B test are due to the actual differences between the two variants rather than random chance. For example, if an A/B test reaches 95% statistical significance, you can be 95% confident the results are reliable.2
Segmentation: The process of dividing users into smaller groups based on shared characteristics such as gender, age, location, income level, device, preferences, and purchase history. This will give you greater insight into the distinct preferences of each audience group.
Multivariate Testing: While A/B testing focuses on just two specific variables, multivariate tests are used to compare a larger number of variables and see how they interact with one another.
Before A/B testing, marketers often made decisions based on their subjective opinions and assumptions. As such, they struggled to fine-tune advertisements and maximize the efficacy of their campaigns, leading to high costs and unpredictable returns on investment (ROI).
Claude C. Hopkins, an advertiser who introduced many new concepts to the world of marketing, was one of the first to address these problems.3 Working with various brands through the early 1900s, he effectively popularized the use of test campaigns. He tested different headlines, offers, and propositions against each other to determine what performed best, then would use the results to improve his advertising performance. Clearly, this was the right move, as Hopkins achieved great success as an advertiser. He’s even credited with popularizing tooth brushing in his advertising campaigns for Pepsodent toothpaste!
Hopkins published a book titled Scientific Advertising in 1923, which has since been cited as one of the most important pieces of literature in the marketing world. In the book, he describes the split-testing process and outlines an advertising approach based on scientific testing and measuring.4 The goal: to minimize losses from unsuccessful campaigns and maximize profits from successful ads.
Hopkins asserted that the only way to answer questions about your audience is through the use of test campaigns. These arguments revolutionized the world of advertising and encouraged other brands to begin using controlled experiments to refine their marketing strategies. One of the first uses of A/B testing in UX design occurred in 1960 when Bell Systems experimented with different versions of the buttons on telephone sets.1
With the growth of the internet, new ways to conduct these advertising tests emerged. Suddenly, A/B testing became incredibly fast and scalable. It was now possible to collect data in real-time through clicks, sign-ups, and other user interactions rather than waiting around for people to show up with print coupons they received in the mail.
Google was one of the first to run digital A/B tests, conducting their first test in 2000 to determine the ideal number of results to display on their search engine results page.5 After this, A/B testing grew popular, becoming standard practice in the digital marketing industry. Today, companies like Google run over 10,000 A/B tests every year!
The popularization of A/B testing revolutionized the world of advertising, as these data-driven insights are critical to the success of modern-day marketing campaigns. In a business culture that values continuous improvement and optimization, A/B testing enables businesses to keep up with competitors and respond to ever-changing consumer trends. A/B testing also plays an important role in reducing advertising costs. By using A/B tests to identify the most effective strategies in advance, marketers can extract more value from their advertising budgets.
Beyond its use in marketing campaigns, A/B testing has become fundamental to product design and development. Instead of making assumptions about customer preferences, A/B tests enable designers to find user-based answers to common design questions, like which hyperlink colors will encourage more clicks or which form layout will drive more sign-ups.
Through A/B testing, brands often identify small design changes that produce significant results. For example, Amazon used an A/B test to discover that simply moving their credit card offers from the home page to the shopping cart page would boost their profits by tens of millions of dollars annually.5
This approach reduces the risks associated with making design changes, allowing organizations to assess the impact of these changes before committing to full implementation. For more costly upgrades, brands use A/B testing to determine if a change will produce enough value to justify the expense. Does increasing page load speed by 100ms increase revenue enough to cover the cost of this upgrade? In an A/B test conducted by Bing, the answer was a resounding yes!5
While A/B testing has become commonplace for marketers, it's not without its share of criticisms. During the first Practical Online Controlled Experiments Summit in 2018, several A/B testing experts—from organizations including Airbnb, Amazon, Netflix, Facebook, and Google—produced a paper addressing the top challenges of running A/B tests at such a large scale.6 According to these experts, one of the main issues is estimating the long-term effect of certain changes. For example, placing more ads at the top of a search engine results page might boost short-term revenue, but it could lead to reduced engagement in the long run as users learn to ignore these ads or become frustrated with their abundance. A short-term A/B test could overlook these long-term effects.
Another issue is the amount of time it takes to get statistically significant results from a test, especially for smaller companies. You have to wait for enough people to respond to the variants before you can be confident in the reliability of the results. By the time the A/B test is complete and you can act on the results, consumer trends may have changed. Because of this, A/B testing tends to be more effective for companies with large sample sizes.
Marketing experts have highlighted several other drawbacks of relying on A/B testing for product design. One clear problem is that you can only use the test to answer very specific design questions. You’re limited to testing small changes (the color of a button, the placement of an image, etc.), and you have to test these one at a time. If you don’t get a significant result from your test, these “small” changes can waste time and resources.
At the same time, A/B testing doesn’t offer insight into customer behavior. Because you won’t know why people prefer one option over the other, you cannot use the results of your A/B test to move forward with other design decisions. It only tells you what works with the very specific variant you’re testing.
In a similar vein, A/B testing doesn’t help you discover potential optimizations that you are not testing. If there is a problem with your website that you haven’t noticed, A/B testing will not reveal it. The results of these tests are not open-ended, such as those of usability tests where users attempt to complete tasks to reveal potential problems. Ideally, you would use both quantitative methods (such as A/B testing) and qualitative methods (such as usability testing) to learn about user preferences and behavior.
Real-world examples of A/B testing are abundant, producing widely successful results in various industries, from email marketing and social media to e-commerce and streaming platforms. A/B testing is even used in political campaigns.
Dan Siroker, co-founder of web-testing company Optimizely, worked as Director of Analytics for the Obama campaign in 2008. At the time, the campaign website struggled to turn visitors into email subscribers, which was necessary to drum up donations. Originally, people landing on the website would see a simple page with a color photo of Obama, a call to action, and a red “sign up” button.
Siroker decided to test different variants of the landing page in an effort to boost sign-ups.7 He created four different button variations labeled: “join us now”, “learn more”, “sign up now”, and “sign up”, as well as six different photo/video variants. Every visitor to the page was randomly shown one combination of these variants.
During the test, 310,382 people visited the page, and roughly 13,000 of these people saw each variant. The best-performing combination was the “learn more” button under a black-and-white image of the Obama family, which landed 40.6% more signups than the original version. Throughout the campaign, these simple changes brought in an additional 2,880,000 email addresses! Given that each email sign-up generated an average donation of $21 by the end of the campaign, the tests led to an additional $60 million in donations.
Beyond these impressive results, there’s another interesting thing to note about this case study—the test results went directly against the opinions of the campaign staff. The staff preferred a video of Obama speaking at a rally over the still photos and assumed the video would outperform any photo variants on the landing page. In reality, all of the video variants performed worse than the images of Obama. In this case, it would have been a huge mistake if the team had gone with their preferences instead of running the A/B test.
Related TDL Content
Cognitive biases and subjective opinions can cloud our decision-making abilities, leading us toward choices that are less than ideal. This is why data is key. Data-driven decision-making mitigates some of these biases so we can more accurately predict the outcomes of our choices. This article explores the concept of data science in detail.
Project managers are often prone to cognitive errors when interpreting user testing data, including data from A/B tests. This article explores these errors in the context of how project managers rely on statistical significance when determining whether or not to implement product changes. Give it a read to learn how you can avoid making a common mistake.
- Young, Scott W. H. (2014). Improving Library User Experience with A/B Testing: Principles and Process. Weave: Journal of Library User Experience, 1(1). https://doi.org/10.3998/weave.12535642.0001.101
- Tenny S, Abdelgawad I. (2023, November 23) Statistical Significance. StatPearls Publishing. Retrieved from: https://www.ncbi.nlm.nih.gov/books/NBK459346/
- Much, M. (2018, December 20). Claude Hopkins Turned Advertising Into A Science, Brands Into Household Names. Investor’s Business Daily. Retrieved from: https://www.investors.com/news/management/leaders-and-success/claude-hopkins-scientific-advertising-bio/
- Hopkins, Claude C. (1923) Scientific Advertising. ScientificAdvertising.com. https://www.scientificadvertising.com/ScientificAdvertising.pdf
- Kohavi, Ron; Thomke, Stefan (September 2017). The Surprising Power of Online Experiments. Harvard Business Review: 74-82. Retrieved from: https://hbr.org/2017/09/the-surprising-power-of-online-experiments
- Gupta, S., Kohavi, R., Tang, D., Xu, Y., Andersen, R., Bakshy, E., ... Yashkov, I. (2019, June). Top Challenges from the first Practical Online Controlled Experiments Summit. SIGKDD Explorations, 21(1), 20. Retrieved from: https://bit.ly/OCESummit1
- Siroker, Dan (2010, November 29) Obama's $60 million dollar experiment. Retrieved from: https://www.optimizely.com/insights/blog/how-obama-raised-60-million-by-running-a-simple-experiment/