“We’ve learned that the best data scientists are skeptics and follow Twyman’s law: Any figure that looks interesting or different is usually wrong. Surprising results should be replicated—both to make sure they’re valid and to quell people’s doubts.” Harvard Business Review, Oct 2017
How many times do you replicate a statistically significant web marketing test before reporting it out, making recommendations, or otherwise acting on it as if it were true?
If the answer is “rarely or never”, your web experimentation process is likely missing one of the critical components of scientific reasoning and knowledge discovery: replication.
For the past decade or so, online controlled experiments (such as A/B testing) have become common on practically all major commercial websites.
Most organizations operate under the assumption that with the right tools they can easily and predictably improve their KPIs by adding 5 percent incremental improvement here, 8 percent improvement there, and so forth towards future scientifically proven optimal – or at least improved – states while eliminating or substantially reducing the risk involved in making a change.
We’ve all seen the reports that A/B testing works for sites like Amazon, Facebook, Bing, Google, and the Obama campaign. But for the preponderance of commercial websites that don’t get anywhere close to this level of traffic, and don’t have an 80-person Analysis and Experimentation team like Microsoft, is scientifically valid progress truly being made in the realm of web experiments towards optimal states of user and system behavior?
Attempting to reproduce the results of an experiment or test is a defining feature of the scientific method.
Claims of truth based on a single experiment are misguided. No claim (scientific or otherwise) should be considered an accurate representation of reality unless it can be replicated.
This is one of the reasons scientific and other journals exist – to allow critical peer review and enable third parties to verify methods and results through replication.
Web professionals often already rely on reproducibility in their daily modes of operation without even thinking about, whether or not it’s a part of their experimentation process.
Consider the case of a website editor who notices what at first take appears to be an error in the loading and presentation of a webpage.
The foolish editor will immediately fire off a hasty email alerting colleagues to the problem.
The wise editor, though, will refresh the page, try a different page on the same website, try the same page in a different browser, try the same page on a cell phone, try the same page on a different web connection – all examples of replication before gaining a level of certainty that there is, in fact, a real issue worthy of alerting the QA or dev team.
Replication Is Critical to Knowledge
As Neil deGrasse Tyson told Joe Rogan, without replication there is no scientific objective truth.
Many people are shocked to learn that it’s the nature of science that most research in scientific journals will ultimately be shown to be wrong.
The same is true for experiments run on the web by non-scientists who have not had their results pass the scrutiny of a journal submission.
As Tyson puts it, replication of experiments is “the bleeding edge of science” because it deals with complete unknowns.
There are no facts in the back of a textbook for verification that you conducted the experiment correctly – to say nothing of your interpretation of the results.
Anyone who has taken high school chemistry should recall the wild variation in findings that you or your lab partner found when attempting to replicate the most basic of experiments.
The modern scientific establishment has been described as being in a reproducibility crisis because scientific experiments are not being reproduced before being reported or taken as fact by the public, and when experiments are re-run the findings are very often not confirmed.
That findings are invalidated through failed replication is a natural part of the process of science and not a reason for alarm, but if replication studies are never conducted, results are reported prior to replication, or worse yet failed replications are thrown out rather than accounted for, then it becomes an enormous problem.
People are making decisions and acting based on faulty claims of truth.
As Tyson points out in that interview, news reporters take new findings and report them as fact when it makes for a good story.
Similarly, many people conducting controlled experiments on the web wrongly consider preliminary findings as fact. It is tempting to do this without realizing you are committing an error because it feels like progress is being made.
The reality is that a single test is not a scientifically valid “fact” and shouldn’t be acted on as-if it is. Tests must be repeated, with identical setup and with slight variations, to eventually demonstrate the truthfulness of a body of experiments.
It isn’t only scientific experiments that require replication to tease out insights and discover knowledge. Consider the “State of” report style that is increasingly common and is often the result of an annual survey of professionals in an industry.
The recently released DORA (DevOps Research & Assessment) State of DevOps 2018 report, for example, shows changes and trends over time due to replication of the survey.
On page 6 of the 2018 DORA report, replication is noted as critical to “confirm and revalidate prior years’ results”. A one-off survey is unable to show changes or verify previous findings. The data exists in a vacuum and as such isn’t as reliable as similar data from a multi-year survey.
This sort of long-term experiment is known as a longitudinal study, and although it isn’t identical to replication, the two approaches solve for many of the same problems.
Replication is today rarely included in web experiment processes. In a Harvard Business Review interview with Kaiser Fung, founder of the applied analytics program at Columbia University, it’s noted that:
“We tend to test it once and then we believe it. But even with a statistically significant result, there’s a quite large probability of false positive error.”
“False positives can occur for several reasons. For example, even though there may be little chance that any given A/B result is driven by random chance, if you do lots of A/B tests, the chances that at least one of your results is wrong grows rapidly.”
It is vital to understand that statistical significance has no bearing on the validity of an experiment’s test of a hypothesis, and reaching statistical significance certainly doesn’t mean replication isn’t necessary.
Microsoft is one organization that publicly acknowledges its use of replication for web experiments. As they explain in their comprehensive report A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments (section 5.5):
“We ran an experiment in Bing.com where we observed a statistically significant positive increase for one of the key Bing.com OEC metrics…Very few experiments succeed in improving this metric…. whenever key metrics move in a positive direction we always run a certification flight which tries to replicate the results of the experiment by performing an independent run of the same experiment. In the above case, we reran the experiment with double the amount of traffic and observed that there were no statistically significant changes for the same metric.”
In this example from Bing, a finding that was initially thought to be positive and valid, and was statistically significant, was in fact shown to be inaccurate upon replication.
Try telling your CMO that a recent A/B test showing a 20 percent lift in the desired behavior, despite the initial experiment reaching statistical significance, is most likely an anomaly and further testing is required before determining the truth value of the original findings. This is simply the nature of science.
It often takes a long time, often moves slowly, and always requires meticulous documentation and replication of experiments before considering a claim to be objectively true.
Replication’s Role in the Scientific Method
Reproducibility and falsifiability (the capacity for a hypothesis to be proven wrong) are fundamental to the scientific method that has resulted in our awesome modern world.
For those readers who need a refresher on the scientific method, I recommend this handy guide published by the UC Museum of Paleontology of the University of California at Berkley, in collaboration with the National Science Foundation and a diverse group of teachers and scientists.
Regarding the role of replication, the guide advises:
“If a finding can’t be replicated, it suggests that our current understanding of the study system or our methods of testing are insufficient…. In some fields, it is standard procedure for a scientist to replicate his or her own results before publication in order to ensure that the findings were not due to some fluke or factors outside the experimental design. The desire for replicability is part of the reason that scientific papers almost always include a methods section, which describes exactly how the researchers performed the study. That information allows other scientists to replicate the study and to evaluate its quality, helping ensure that occasional cases of fraud or sloppy scientific work are weeded out and corrected.”
Similar standards of replication must be a requirement for controlled experiments on the web. Marketers, like scientists, should replicate their own experiments before broadcasting the results or acting on them as if they are true.
Peer Review & the Role of Journals in Science
We’ve seen that replication is a critical step for valid scientific findings. After that has been completed, the next step to validate findings from an experiment is peer review from unbiased experts.
Science journals exist as a platform for peer review, allowing experts to provide feedback on experiment design, analysis, and findings.
Many scientific findings are rejected from inclusion in journals because they are not up to the standards of review committees responsible for maintaining quality and rigor. Only the small subset of overall scientific finding that withstand the scrutiny of unbiased experts are worthy of being considered valid enough to include in a journal.
The great deal of time spent on research that doesn’t make it into a journal isn’t a waste of time, it is a necessary condition for scientific progress.
Even with the added safety net of peer review scrutiny, poorly designed experiments still manage to get published.
As Understanding Science notes:
“Many fields outside of science use peer review to ensure quality. Philosophy journals, for example, make publication decisions based on the reviews of other philosophers, and the same is true of scholarly journals on topics as diverse as law, art, and ethics.”
The only platform known to this author that can solve for peer review today is GoodUI, which can provide ideas and added rigor for marketing organizations conducting experiments.
While this may initially seem untenable as organizations tend to be greedy with their data, the net effect of contributing to a cumulative body of trusted web experiment knowledge should prove to be worth the risk and rigor for those web experimenters who pursue this path.
If scientists worldwide began working in only small isolated pockets without peer review and published experiment methodologies, without an accumulative body of knowledge, scientific progress as we know it would cease.
While replication can help to solve issues with the methods and execution of an experiment, it will not solve for misinterpretation of results, which is shockingly common. This is where critical peer review becomes necessary.
Causes & Scope of the Reproducibility Crisis in Controlled Web Experiments
There is no way to be certain that most organizations are not reproducing experiments before acting on them, but there is evidence of why this would the case – and anecdotally I’m sure most readers will agree with the Harvard Business Review cited earlier that replication isn’t a commonly accepted step in the web testing process.
Lack of reproducibility for web experiments on commercial websites, to the extent that it exists, is likely due to a combination of the following forces:
- Sincere desire for positive results and KPI improvement is probably the most common reason a finding isn’t retested multiple times.
- Commercial organizations are focused on speed – and waiting 6 months, a year, or longer to prove something sounds absurd to many businesses. This is the result of a misunderstanding of the nature of scientific experiments.
- A/B testing is often over-simplified and over-sold as an easy way to prove ideas, resulting in inflated expectations and incentivization to show success through positive results.
- Lack of oversight on experimental methods and analysis of results with no critical internal or external peer review.
- Current generation A/B testing tools typically fail to suggest (or require) replication as a necessary next-step after the conclusion of an initial experiment.
A Path Forward: Bringing Scientific Rigor to Web Experiments
Some organizations may need to adjust their culture of testing and re-set stakeholder expectations about the use cases appropriate for scientific experiments, rigor required for valid scientific experiments, and duration required for initial significance as well as repeated replication.
Scientific certitude doesn’t come quickly or easily just because a team has access to an A/B testing tool and the desire for improvement.
Here are a few suggestions for improving the validity of your web experiments:
- Run fewer tests, test longer, and replicate tests. Reject low-value experiment ideas based on low sample size and impractical duration during the initial analysis.
- Carefully determine sample size and test duration before launching (or even working) on a test. OK, you’ve got a great idea for a test, you know exactly how to set it up, and it’s on an important webpage that gets a fair amount of traffic. Time to set this test up! Not so fast. Your time is better spent designing a valid experiment and finding out whether findings are likely to be statistically significant in an acceptable duration of time (including replication) before setting up an A/B test. I’ve found it widely accepted, both from experts online and from my colleagues, that 250 conversion events (purchases, signups, leads, etc.) are a bare minimum for a test, and any test with fewer than this volume for both control and test versions should be ignored – yet this advice is often not followed. Do not look at results until the pre-determined sample size is achieved.
- Try to reproduce your experiments exactly as you ran them the first time. Did you get a different result or the same? If different, what was different about the experiment design or testing sample?
- Try to reproduce your experiments with slight variations from the original. Did you get a different result or the same? If different, was it most likely due to the known slight variation or was there a difference in the experiment design or testing sample?
- Carefully examine what happens after you implement the results of what is thought to be a valid test. Did the change result in the expected lift in KPI? Scientific validity is useful insofar as it can accurately represent – and predict – states of reality. Taking an action to try to influence or change behavior should be judged not on the level of statistical significance of the experiment that suggested such action, but on the resultant change in user or system behavior over the long term.
- Use post-mortem analysis to figure out what happened. Take an impartial look at the experiment design to determine how a replication study may have differed from the original and how the sample size & duration may have differed from the original.
- Seek internal peer review. If you’ve replicated a result, and think you may be on to something, explain the experiment to colleagues and seek criticism of the test methodology, analysis, and recommendations before deciding to act.
- Accept that A/B testing very rarely results in massive gains. We only see case studies and conference speeches about the ludicrously successful A/B tests, but the reality is that the vast majority of tests don’t result in improved metrics for an organization. Focusing only on the winners is a fallacy of sample selection based on outcomes. As mentioned earlier, it is the nature of science that even positive results from an initial experiment are likely to be proven wrong in the future.
- Calculate the potential ROI for your testing program and realistically determine if valid experiments are worth the effort. This may sound like blasphemy, but not all websites receive enough traffic and enough revenue from online interactions to make A/B testing worthwhile. If your organization cannot afford the staff for an experiments program, doesn’t have the traffic volume required for valid tests, or doesn’t have the patience to wait for statistical significance and replication, then your time and energy may be more prudently invested in other marketing tactics.
If your experiment idea isn’t worth the time it takes to follow these steps carefully, it probably isn’t worth testing to begin with.
Bring mindfulness to your decisions about which experiments to run and don’t get caught up in the hype tool vendors generate.
Use your best judgment for changes in cases that really aren’t appropriate for an experiment, and monitor KPIs after these sorts of changes are made to detect impact.
Meanwhile, keep looking for something to test that is worth the effort necessary to achieve legitimate results.
I will conclude this lengthy post with a quote from Stephen Jay Gould, from Full House: The Spread of Excellence from Plato to Darwin:
“[O]ur strong desire to identify trends often leads us to detect a directionality that doesn’t exist, or to infer causes that cannot be sustained.”
Be wary of this inclination as it relates to experiments on the web.
Special thanks to the invaluable research published at https://exp-platform.com/.