Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

Book Cover

Authors: Ron Kohavi, [Diane Tang, [Ya Xu Tags: A/B testing, data science, product management, statistics, innovation Publication Year: 2020

Overview

In our book, we provide a practical guide to accelerating innovation by using trustworthy online controlled experiments, or A/B tests. Our central message is simple: getting numbers is easy, but getting numbers you can trust is hard. We’ve distilled decades of our collective experience leading experimentation at Microsoft, Google, and LinkedIn—companies that run tens of thousands of experiments annually—to create a definitive text for students, engineers, data scientists, product managers, and executives. We argue that while intuition is valuable for generating ideas, it is a poor tool for predicting outcomes; our data shows that only about a third of ideas actually improve the metrics they are designed to. Therefore, rigorous, large-scale experimentation is the most reliable engine for product improvement and business growth. We guide you through the entire lifecycle of an experiment, from defining key metrics and a singular [[Overall Evaluation Criterion (OEC)]] to understanding the statistical foundations and, crucially, avoiding common pitfalls. We address subtle but critical issues like [[Twyman’s Law]] (any surprising result is likely wrong), [[Simpson’s Paradox]], and [[Sample Ratio Mismatch (SRM)]], which can invalidate results if not properly checked. This book is for any organization that wants to build a data-driven culture. In an era dominated by AI and machine learning, the ability to rigorously test ideas and establish causality is no longer a niche skill but a core competency. We provide the principles, techniques, and cultural tenets necessary to build a robust experimentation platform and make consistently better, data-informed decisions.

Book Distillation

1. Introduction and Motivation

Controlled experiments are the scientific gold standard for establishing causality, far superior to relying on correlations or expert opinions, often called the HiPPO (Highest Paid Person’s Opinion). Most new ideas, even those from experts, fail to improve key metrics, which makes it crucial to test them rather than deploy them based on intuition. A successful experimentation culture requires three tenets: a desire to be data-driven, formalized in an [[Overall Evaluation Criterion (OEC)]]; a willingness to invest in the infrastructure to run trustworthy tests; and the organizational humility to recognize that we are poor at assessing the value of ideas.

Key Quote/Concept:

Tenet 3: The Organization Recognizes That It Is Poor at Assessing the Value of Ideas. This is a humbling but critical realization for any organization wanting to innovate effectively. Data from Microsoft, Google, and others show that only 10-30% of experiments generate positive results, meaning most well-intentioned ideas fail to improve the user experience or business metrics.

2. Running and Analyzing Experiments: An End-to-End Example

A successful experiment begins with a clear hypothesis and a well-defined OEC, such as revenue-per-user for users who trigger the change. Proper design involves choosing a randomization unit (usually users), determining the target population, and conducting a power analysis to set the experiment’s size and duration. After running the experiment and collecting data, it’s critical to run sanity checks on [[guardrail metrics]] before interpreting the results and making a launch decision based on both statistical and practical significance.

Key Quote/Concept:

Statistical vs. Practical Significance. A result can be statistically significant (unlikely to be due to chance) but not practically significant (the effect is too small to justify the cost of implementation). Defining a practical significance boundary before the experiment is crucial for making sound launch/no-launch decisions.

3. Twyman’s Law and Experimentation Trustworthiness

Any result that looks interesting or different is usually wrong. Extreme results often stem from instrumentation errors, data loss, or misinterpretation of statistics. Common threats to [[internal validity]] include violations of the Stable Unit Treatment Value Assumption (SUTVA), survivorship bias, and Sample Ratio Mismatch (SRM). Threats to [[external validity]] include primacy and novelty effects, where the short-term impact of a change differs from its long-term effect.

Key Quote/Concept:

Sample Ratio Mismatch (SRM). An SRM occurs when the observed ratio of users between variants (e.g., 50/50) significantly deviates from the designed ratio. It’s a critical trust-related guardrail metric; an SRM indicates a fundamental bug in the experiment setup or data pipeline, invalidating the results.

4. Experimentation Platform and Culture

Scaling experimentation requires both a technical platform and a supportive culture. Organizations mature through phases: Crawl, Walk, Run, and Fly, progressively increasing experiment velocity and complexity. A robust platform must handle experiment definition, deployment (variant assignment), instrumentation, and analysis. A culture of experimentation requires leadership buy-in, shared goals codified in an OEC, and processes that encourage learning from failures.

Key Quote/Concept:

Experimentation Maturity Models (Crawl, Walk, Run, Fly). This framework describes the phases an organization goes through to become data-driven. ‘Crawl’ is about building prerequisites, ‘Walk’ focuses on running more experiments and building trust, ‘Run’ is about scaling, and ‘Fly’ is when experimentation is the norm for every change, supported by automation and institutional memory.

5. Speed Matters: An End-to-End Case Study

Website and application performance is critical. Even millisecond-level changes in latency can have a significant impact on user engagement and revenue. The impact of latency can be quantified using [[slowdown experiments]], where the Treatment is intentionally slowed relative to the Control. Based on the key assumption of local linear approximation, the negative impact of the slowdown can be used to estimate the positive impact of a corresponding speedup.

Key Quote/Concept:

Slowdown Experiment. This is a powerful technique to quantify the return-on-investment (ROI) of performance improvements. By intentionally degrading performance for a Treatment group and measuring the negative impact on key metrics, you can estimate the value of speeding up the product, justifying investment in performance engineering.

6. Organizational Metrics

Good metrics are essential for measuring progress and ensuring accountability. A useful taxonomy includes: [[Goal metrics]] (or true north metrics) that reflect ultimate success; [[Driver metrics]] (or surrogate metrics) that are shorter-term and causally linked to goals; and [[Guardrail metrics]] that protect against negative side effects. Good metrics should be simple, stable, sensitive, actionable, and resistant to gaming.

Key Quote/Concept:

Metrics Taxonomy: Goals, Drivers, and Guardrails. This framework helps organizations align on what to measure. Goal metrics define success, Driver metrics track the inputs that lead to success, and Guardrail metrics prevent success at an unacceptable cost (e.g., degrading performance or user trust).

7. Metrics for Experimentation and the Overall Evaluation Criterion

Not all business metrics are suitable for experimentation. Experiment metrics must be measurable, attributable, sensitive, and timely. To handle tradeoffs between multiple key metrics (e.g., engagement vs. revenue), they should be combined into a single [[Overall Evaluation Criterion (OEC)]]. A good OEC makes the definition of success explicit, aligns the organization, and empowers teams to make decisions without constant escalation.

Key Quote/Concept:

Overall Evaluation Criterion (OEC). The OEC is a quantitative measure of the experiment’s objective, often a weighted combination of key metrics. It is the single score that determines an experiment’s success, encoding the strategic tradeoffs the organization is willing to make and providing a clear alignment mechanism for innovation.

8. Institutional Memory and Meta-Analysis

As an organization scales experimentation, it builds a digital journal of all changes and their outcomes, known as [[Institutional Memory]]. This repository can be mined through meta-analysis to solidify the experiment culture (e.g., showing the low success rate of ideas), identify best practices, inspire future innovations by revealing patterns, and deepen the understanding of key metrics and their relationships.

Key Quote/Concept:

Meta-Analysis. This is the practice of mining data from the institutional memory of past experiments. It allows you to answer questions like: ‘What is the overall success rate of our ideas?’, ‘Which types of changes are most effective?’, and ‘How do our key metrics relate to each other in practice?’.

9. Ethics in Controlled Experiments

Experiments are conducted on real people, making ethical considerations critical. Key principles, adapted from the Belmont Report, are respect for persons (transparency, consent), beneficence (minimizing risk), and justice (fair distribution of risks/benefits). A useful litmus test is [[equipoise]]: if you are willing to ship a feature to 100% of users, you should be willing to test it on 50%. Deception experiments carry higher ethical risk than standard A/B tests.

Key Quote/Concept:

The A/B Illusion. This refers to the resistance some have to running an experiment when they would be perfectly willing to launch the change to 100% of users without testing. Shipping a feature to everyone is itself an experiment, just an uncontrolled and poorly measured one. If a change is acceptable for all users, it should be acceptable to test it first.

10. Complementary and Alternative Techniques

Online experiments are powerful but should be complemented by other techniques. An ‘ideas funnel’ can be populated using logs-based analysis, user experience research (UER), focus groups, surveys, and human evaluation. When randomized experiments are not possible, [[observational causal studies]] (e.g., Interrupted Time Series, Regression Discontinuity) can be used to assess causality, albeit with lower levels of trust and a high risk of being misled by confounding variables.

Key Quote/Concept:

Hierarchy of Evidence. Randomized controlled experiments are the gold standard for establishing causality. Observational studies and other qualitative methods are lower on the hierarchy but are essential for generating hypotheses and providing context. Using multiple methods to triangulate towards a more accurate measurement is a robust strategy.

11. Building an Experimentation Platform (Advanced Topics)

Building a scalable platform involves key technical decisions. [[Client-side experiments]] (e.g., on mobile apps) introduce challenges with release cycles and data communication not present in server-side tests. Reliable [[instrumentation]] is a critical prerequisite. Choosing a [[randomization unit]] (e.g., user vs. session) is a fundamental decision that impacts user experience and analysis validity. As you scale, a principled [[ramping framework]] (SQR: Speed, Quality, Risk) is needed to safely deploy experiments, and analysis pipelines must be scaled and automated.

Key Quote/Concept:

SQR Ramping Framework. This framework balances Speed, Quality, and Risk when rolling out experiments. It consists of four phases: Pre-MPR (Maximum Power Ramp) for risk mitigation, MPR for precise measurement, Post-MPR for operational scale-up, and optional Long-Term Holdouts for learning.

12. Analyzing Experiments (Advanced Topics)

Trustworthy analysis requires deep statistical understanding. Key concepts include the two-sample t-test, p-values, and confidence intervals. Improving sensitivity often involves [[variance reduction]] techniques like CUPED or triggered analysis. [[A/A tests]] are essential for validating the experimentation system itself. Advanced challenges include dealing with leakage and interference between variants (e.g., in social networks) and designing experiments to measure [[long-term treatment effects]], which may differ significantly from short-term results.

Key Quote/Concept:

The A/A Test. This is a test where both variants are identical (Control vs. Control). It is a powerful diagnostic tool for an experimentation platform. If an A/A test shows statistically significant differences more than 5% of the time (the expected Type I error rate), it indicates a fundamental bug in the system, such as incorrect variance calculation, bad randomization, or data pipeline issues.

Generated using Google GenAI

Essential Questions

1. Why are trustworthy online controlled experiments considered the ‘gold standard’ for product innovation over intuition or correlational analysis?

We argue that online controlled experiments are the gold standard because they are the most reliable method for establishing [[causality]]. While intuition, often from the HiPPO (Highest Paid Person’s Opinion), is valuable for generating hypotheses, it is a poor predictor of outcomes. Our data from Microsoft and other tech giants consistently shows that only about a third of new ideas, even those from experts, actually improve the metrics they are designed to impact. Correlational studies are also frequently misleading; for example, one might observe that users who experience more software crashes also have lower churn rates. This doesn’t mean crashes are good; it means heavy users, who are inherently less likely to churn, are also more likely to encounter bugs. Controlled experiments, through [[randomization]], isolate the change and create statistically similar groups, ensuring that any observed difference in the [[Overall Evaluation Criterion (OEC)]] is, with high probability, caused by the change itself. This rigorous approach prevents organizations from investing in features that don’t add value or, worse, harm the user experience, making it the most effective engine for sustainable product improvement.

2. What is an Overall Evaluation Criterion (OEC), and why is its formalization a critical tenet for building a data-driven culture?

The [[Overall Evaluation Criterion (OEC)]] is a single, quantitative metric that measures the success of an experiment, encapsulating the strategic goals of an organization. It is often a weighted combination of several key metrics, representing the trade-offs the business is willing to make. For instance, an OEC for a search engine might balance user engagement (like sessions-per-user), relevance, and revenue. Formalizing an OEC is a critical tenet because it moves decision-making from subjective debate to objective evaluation. It forces leadership to explicitly define ‘what success looks like’ and aligns the entire organization around a shared goal. Without a clear OEC, teams can ‘cherry-pick’ metrics that make their feature look good, leading to conflicting priorities and shipping changes that may harm the overall business. A well-defined OEC empowers teams to innovate and make launch decisions autonomously, as the definition of a ‘win’ is clear and universally understood. It is the mechanism that translates high-level strategy into a measurable, actionable compass for day-to-day product development, which is the essence of a truly data-driven culture.

3. What are the most significant threats to an experiment’s trustworthiness, and how can they be mitigated?

The core message of our book is that getting numbers is easy, but getting trustworthy numbers is hard. The most significant threats are subtle issues that can invalidate results. One is captured by [[Twyman’s Law]]: any surprising or interesting result is likely wrong, often due to instrumentation bugs or data pipeline errors. A critical and common threat is [[Sample Ratio Mismatch (SRM)]], where the observed number of users in Control and Treatment groups deviates significantly from the intended split (e.g., 50/50). An SRM is a red flag indicating a fundamental bug in the assignment or logging process, rendering the results untrustworthy. Other threats include violations of SUTVA (e.g., interference between users in a social network), and failing to account for novelty or primacy effects, where short-term results don’t reflect long-term user behavior. Mitigation requires building a robust experimentation platform with automated sanity checks. We advocate for always checking for SRM as a primary guardrail metric, running A/A tests to validate the system’s integrity, and plotting results over time to detect novelty effects. Fostering a culture of healthy skepticism is paramount to ensuring these checks are taken seriously.

Key Takeaways

1. Most Ideas Fail: Embrace Humility and Test Everything

A fundamental and humbling realization from running tens of thousands of experiments is that most new ideas, even those from seasoned experts, do not improve the key metrics they are designed to. Our data across Microsoft, Google, and other companies shows a success rate of only 10-30%. This means that for every three features developed, two of them will likely have a neutral or negative impact on the user experience and business goals. This insight is critical because it dismantles the reliance on intuition or the ‘HiPPO’ and makes the case for a culture of experimentation. Without rigorous testing, organizations risk deploying a majority of changes that degrade their product. Recognizing this low success rate fosters organizational humility and justifies the investment in a platform that can test every significant change, ensuring that only value-adding features are shipped to users. It shifts the goal from ‘shipping features’ to ‘improving outcomes’.

Practical Application: An AI product engineer proposes a new, more complex recommendation model that is expected to improve user engagement. Despite the model’s superior offline evaluation metrics, this takeaway justifies allocating engineering resources to A/B test it against the simpler, existing model. The engineer should pre-register the hypothesis and the OEC (e.g., a combination of click-through rate, session duration, and diversity of recommendations). If the experiment shows the new model is flat or negative on the OEC, the team can avoid the long-term costs of maintaining a more complex system for no real user benefit, and instead iterate on a new hypothesis.

2. Trust is Paramount: Implement Rigorous Guardrails and Sanity Checks

The central theme of our work is that the trustworthiness of experimental results is paramount. A misleading experiment is worse than no experiment at all because it gives false confidence and leads to poor decisions. Trust is built not just on sound statistical theory but on a robust platform and process that actively seeks to find errors. We emphasize the importance of [[guardrail metrics]]—invariants that should not change during an experiment. The most critical of these is checking for [[Sample Ratio Mismatch (SRM)]]. An SRM is a powerful signal that something is fundamentally broken in the experiment setup, randomization, or data pipeline. Other essential checks include running A/A tests (Control vs. Control) to validate the system’s statistical integrity and monitoring for extreme results, which are often symptoms of bugs per [[Twyman’s Law]]. Building these checks into the platform and culture ensures that when a result is presented, the organization can be confident that it reflects a real change in user behavior, not a system artifact.

Practical Application: An AI product team is building an experimentation platform. They should programmatically enforce a pre-analysis checklist for every result. Before any OEC metrics are even displayed, the system must first run and show the results for an SRM test. If the p-value for the SRM test is below a strict threshold (e.g., 0.001), the scorecard should be hidden or flagged as ‘INVALID,’ forcing the experimenter to investigate the discrepancy. This prevents the team from being misled by a surprisingly positive result that was, in reality, caused by a bug that disproportionately dropped low-activity users from the Treatment group.

3. The OEC is Your Compass: Codify Strategy into a Single Metric

In any complex product, a change will affect multiple metrics, often in conflicting ways. A feature might increase engagement but decrease revenue, or improve one user segment’s experience at the expense of another’s. To make rational, scalable decisions, these trade-offs must be made explicit. The [[Overall Evaluation Criterion (OEC)]] is the mechanism for this. It is a single metric, often a weighted function of several driver and goal metrics, that defines success for the organization. Formulating an OEC is a strategic exercise that forces leadership to quantify trade-offs (e.g., ‘how much short-term revenue are we willing to sacrifice for a 1% gain in long-term user retention?’). Once established, the OEC serves as a unifying compass. It aligns teams, prevents cherry-picking of metrics, and empowers engineers and product managers to make data-informed decisions quickly and consistently, without needing to escalate every complex result for executive debate.

Practical Application: An AI team is developing a new generative AI feature for a search engine. The feature provides direct answers but might reduce clicks on traditional ad links. The team defines an OEC that combines metrics for user satisfaction (e.g., survey scores, low rate of follow-up queries), long-term engagement (sessions-per-user), and revenue, with a negative weight on computational cost. For example: OEC = (2 * ΔSatisfaction) + (1 * ΔEngagement) - (0.5 * ΔRevenue) - (1.5 * ΔComputeCost). This OEC explicitly states the strategic priority: user satisfaction and engagement are more important than short-term revenue, and efficiency is a major concern. An experiment is a ‘win’ only if this combined score is positive.

Suggested Deep Dive

Chapter: Chapter 3: Twyman’s Law and Experimentation Trustworthiness

Reason: This chapter is the heart of the book’s central argument. While many resources discuss the basics of setting up an A/B test, this chapter details the many subtle and non-obvious ways that experiments can go wrong. Understanding concepts like Sample Ratio Mismatch (SRM), survivorship bias, and the difference between internal and external validity is what separates a novice from an expert practitioner. For an AI product engineer, where results can be counterintuitive and system interactions complex, internalizing the healthy skepticism and diagnostic tools from this chapter is critical for building trustworthy products.

Key Vignette

The $100 Million Ad Headline Change at Bing

In 2012, a simple idea to lengthen ad headlines on Bing by combining the title with the first line of text was suggested. The idea was deemed low-priority and sat in the backlog for over six months. A developer finally implemented it as an experiment, and within hours, alerts fired for abnormally high revenue. After verifying the result was not a bug, the experiment showed a staggering 12% increase in revenue, translating to over $100 million annually in the US alone, without harming user experience metrics. This vignette powerfully illustrates how difficult it is to predict the value of ideas and the immense upside of making it cheap and easy to test them.

Memorable Quotes

Getting numbers is easy; getting numbers you can trust is hard.

— Page 15, Preface

Any figure that looks interesting or different is usually wrong.

— Page 39, Chapter 3: Twyman’s Law and Experimentation Trustworthiness

The Organization Recognizes That It Is Poor at Assessing the Value of Ideas.

— Page 13, Chapter 1: Introduction and Motivation

The OEC is the perfect mechanism to make the strategy explicit and to align what features ship with the strategy.

— Page 22, Chapter 1: Introduction and Motivation

The resistance to an online controlled experiment when giving everyone either Control or Treatment would each be acceptable is sometimes referred to as the “A/B illusion”.

— Page 119, Chapter 9: Ethics in Controlled Experiments

Comparative Analysis

Our book, Trustworthy Online Controlled Experiments, serves as a practical, in-the-trenches guide that complements more theoretical works on statistics and more high-level business books on innovation. While a standard statistical textbook will explain the mathematics behind a t-test, our book explains why that test might give you a misleading p-value in a real-world system due to violations of assumptions, such as when the analysis unit differs from the randomization unit. Compared to a book like Eric Ries’s The Lean Startup, which champions the ‘Build-Measure-Learn’ feedback loop, our work provides the deep technical and organizational blueprint for the ‘Measure’ step at scale. We detail the specific pitfalls, [[guardrail metrics]], and platform architecture necessary to make that loop a trustworthy engine for growth rather than a generator of noise. Unlike many introductory A/B testing guides, we move beyond simple UI changes to address complex backend, AI/ML model, and infrastructure experiments. Our unique contribution is the codification of hard-won lessons from decades of experience at scale, formalizing concepts like [[Sample Ratio Mismatch (SRM)]] and the [[Overall Evaluation Criterion (OEC)]] as first-class citizens in the experimentation lifecycle, providing a definitive text for practitioners building and using large-scale experimentation systems.

Reflection

Our goal in writing this book was to create the definitive guide we wished we had when we started our careers. Its primary strength lies in its practicality and grounding in real-world, large-scale operations at Microsoft, Google, and LinkedIn. We intentionally focused on the ‘trustworthy’ aspect, as this is where most organizations falter. The detailed exploration of pitfalls like [[SRM]], [[Twyman’s Law]], and the nuances of metric definition is, we believe, its most significant contribution. However, the book’s focus on large, mature tech companies could be seen as a weakness. A small startup may find the advice on building a sophisticated platform with an institutional memory repository to be overkill for their immediate needs. Furthermore, while we advocate for the [[OEC]] as a mechanism for strategic alignment, one could argue that a rigid focus on a single metric might stifle radical innovation in favor of incremental hill-climbing. We acknowledge this tension by suggesting a portfolio approach to experimentation, but it remains a valid point of discussion. Our opinions are deeply rooted in empirical evidence from running hundreds of thousands of experiments, so they diverge little from facts in that context. The main area of opinion is in the strategic choices encoded in an OEC, which are, by definition, a reflection of a company’s specific values and goals, not universal truths.

Flashcards

Card 1

Front: What is an [[Overall Evaluation Criterion (OEC)]]?

Back: A single, quantitative metric of an experiment’s objective, often a weighted combination of key metrics. It makes the definition of success explicit and aligns the organization by encoding strategic trade-offs.

Card 2

Front: What is [[Twyman’s Law]] in the context of A/B testing?

Back: The principle that any figure or result that looks interesting or different is usually wrong. It encourages skepticism towards extreme results, which are often caused by instrumentation errors or bugs.

Card 3

Front: What is a [[Sample Ratio Mismatch (SRM)]]?

Back: An SRM occurs when the observed ratio of users between variants (e.g., Control and Treatment) significantly deviates from the designed ratio. It is a critical trust-related guardrail metric that indicates a fundamental bug in the experiment setup.

Card 4

Front: What is the typical success rate of new ideas in online experiments at companies like Microsoft?

Back: Only 10-30% of ideas show a statistically significant positive impact on the metrics they were designed to improve. The majority of ideas fail to add value.

Card 5

Front: What are the three key tenets of a successful experimentation culture?

Back:

The organization wants to be data-driven and has a formalized OEC. 2. It is willing to invest in the infrastructure for trustworthy tests. 3. It recognizes that it is poor at assessing the value of ideas (humility).

Card 6

Front: What is the ‘A/B Illusion’?

Back: The ethical or practical resistance some have to running an A/B test on a feature, even though they would be perfectly willing to launch that same feature to 100% of users without any testing. It highlights that launching to everyone is also an experiment, just an uncontrolled one.

Card 7

Front: What is the key assumption that makes [[slowdown experiments]] a valid way to estimate the value of performance improvements?

Back: The principle of local linear approximation. It assumes that the relationship between performance and a key metric is roughly linear around the current operating point, so the measured negative impact of a small slowdown can be used to estimate the positive impact of a corresponding speedup.

Card 8

Front: What is an A/A test and why is it useful?

Back: An A/A test is an experiment where both variants are identical (Control vs. Control). It is a powerful diagnostic tool to validate an experimentation system. If it shows statistically significant differences more than the expected Type I error rate (e.g., 5% of the time), it indicates a bug in the system.

Generated using Google GenAI