p-Hacking and False Discovery in A/B Testing

Ron Berman, Leonid Pekelis, Aisling Scott, and Christophe Van den Bulte, 2018, 18-130-10

2019 Top Download Award Winner

Ron Berman, Leonid Pekelis, Aisling Scott, and Christophe Van den Bulte investigate whether online A/B experimenters “p-hack” by stopping their experiments based on the p-value of the treatment effect. If A/B testers p-hack, this behavior may inflate the number of experiments that show significance but actually have no effect (are null), resulting in incorrect decisions and foregone earnings for the company.

The Study

Their data contains 2,101 commercial experiments from the Optimizely platform, before the platform implemented a new method that alleviates the negative effects of p-hacking. In the data, experimenters can track the magnitude and significance level of the effect every day of the experiment. They use a regression discontinuity design to detect p-hacking, i.e., the causal effect of reaching a particular p-value on stopping behavior.

They find that experimenters indeed p-hack, especially for positive effects. Specifically, about 57% of experimenters p-hack when the experiment reaches 90% confidence, and p-hacking increases the false discovery rate (FDR) from 33% to 42% among experiments p-hacked at 90% confidence. Furthermore, approximately 70% of the effects of all experiments are truly null, i.e., not expected to yield any improvement over the baseline.

Assuming that false discoveries cause experimenters to stop exploring for more effective treatments, the authors estimate the expected cost of a false discovery to be a loss of 1.95% in lift (compared to a median lift of 11% in all experiments), which corresponds to the 76th percentile of observed lifts.

Put into Practice

These findings have implications for practitioners and platform designers.

  • For A/B testers, the authors estimate the costs and identify the source of potential pitfalls that companies should avoid when designing and running experiments. For example, the finding that 70% of treatment effects are null should emphasize to experimenters how difficult it is to design new and effective online innovations, and that they should not be fooled if they find significant effects too often.
  • Furthermore, because agency considerations may be at play, experimenters who use a third party company to design and run experiments for them should be vigilant when receiving reports about the outcomes of these experiments.
  • For platform designers, they uncover how the behavior of experimenters affects the efficacy of their platform, and consequently their customer satisfaction.

The authors discuss several potential causes for this behavior and the potential of different methods to remedy these issues for experimenters.

Ron Berman is an Assistant Professor of Marketing, The Wharton School, Leonid Pekelis is a Statistician, OpenDoor, Aisling Scott is a Research Scientist, Facebook, and Christophe Van den Bulte is the Gayfryd Steinberg Professor and Professor of Marketing, The Wharton School.

The authors thank Optimizely for making the data available for this project. They also thank Eric Bradlow, Elea McDonnell Feit, Raghuram Iyengar, Pete Koomen, and participants at the 2018 INFORMS Marketing Science Conference for feedback.

Related links

 2018-2020 Research Priorities Working Paper Competition Winner


  • Corporate: FREE
  • Academic: FREE
  • Subscribers: FREE
  • Public: $0.00



Employees of MSI Member Companies enjoy the benefits of complete online access to content, member conferences and networking with the MSI community.



Qualified academics benefit from a relationship with MSI through access to, conferences and research opportunities.



The public is invited to enjoy partial access to content, a free e-newsletter, selected reports and more.




Learn more about becoming a member of the institute

Read More

Stay Informed

The MSI Mailing List

Subscribe to our email list to stay informed about upcoming events, news, etc.