Approaches to analyzing binary data for large-scale A/B testing

Wenru Zhou; Miranda Kroehl; Maxene Meier; Alexander Kaizer

doi:10.1016/j.conctc.2023.101091

Approaches to analyzing binary data for large-scale A/B testing

Contemp Clin Trials Commun. 2023 Feb 16:32:101091. doi: 10.1016/j.conctc.2023.101091. eCollection 2023 Apr.

Authors

Wenru Zhou¹, Miranda Kroehl², Maxene Meier³, Alexander Kaizer¹

Affiliations

¹ Department of Biostatistics & Informatics, University of Colorado, United States.
² Digital Platforms Organization Charter Communications, United States.
³ Charter Communications, United States.

Abstract

An industry-academic collaboration was established to evaluate the choice of statistical test and study design for A/B testing in larger-scale industry experiments. Specifically, the standard approach at the industry partner was to apply a t-test for all outcomes, both continuous and binary, and to apply naïve interim monitoring strategies that had not evaluated the potential implications on operating characteristics such as power and type I error rates. Although many papers have summarized the robustness of the t-test, its performance for the A/B testing context of large-scale proportion data, with or without interim analyses, is needed. Investigating the effect of interim analyses on the robustness of the t-test is important, because interim analyses rely on a fraction of the total sample size and one should ensure that desired properties are maintained when a t-test is implemented not just at the end of the study, but for making interim decisions. Through simulation studies, the performance of the t-test, Chi-squared test, and Chi-squared test with Yate's correction when applied to binary outcomes data is evaluated. Further, interim monitoring through a naïve approach with no correction for multiple testing versus the O'Brien-Fleming boundary are considered in designs that allow early termination for futility, difference, or both. Results indicate that the t-test achieves similar power and type I error rates for binary outcomes data with the large sample sizes used in industrial A/B tests with and without interim monitoring, and naïve interim monitoring without corrections leads to poorly performing studies.

Keywords: A/B testing; Academic-industry partnership; Interim monitoring; O'Brien-Fleming boundaries.

Grants and funding

K01 HL151754/HL/NHLBI NIH HHS/United States