Rarefaction is currently the best approach to control for uneven sequencing effort in amplicon sequence analyses

mSphere. 2024 Feb 28;9(2):e0035423. doi: 10.1128/msphere.00354-23. Epub 2024 Jan 22.

Abstract

Considering it is common to find as much as 100-fold variation in the number of 16S rRNA gene sequences across samples in a study, researchers need to control for the effect of uneven sequencing effort. How to do this has become a contentious question. Some have argued that rarefying or rarefaction is "inadmissible" because it omits valid data. A number of alternative approaches have been developed to normalize and rescale the data that purport to be invariant to the number of observations. I generated community distributions based on 12 published data sets where I was able to assess the ability of multiple methods to control for uneven sequencing effort. Rarefaction was the only method that could control for variation in uneven sequencing effort when measuring commonly used alpha and beta diversity metrics. Next, I compared the false detection rate and power to detect true differences between simulated communities with a known effect size using various alpha and beta diversity metrics. Although all methods of controlling for uneven sequencing effort had an acceptable false detection rate when samples were randomly assigned to two treatment groups, rarefaction was consistently able to control for differences in sequencing effort when sequencing depth was confounded with treatment group. Finally, the statistical power to detect differences in alpha and beta diversity metrics was consistently the highest when using rarefaction. These simulations underscore the importance of using rarefaction to normalize the number of sequences across samples in amplicon sequencing analyses.

Importance: Sequencing 16S rRNA gene fragments has become a fundamental tool for understanding the diversity of microbial communities and the factors that affect their diversity. Due to technical challenges, it is common to observe wide variation in the number of sequences that are collected from different samples within the same study. However, the diversity metrics used by microbial ecologists are sensitive to differences in sequencing effort. Therefore, tools are needed to control for the uneven levels of sequencing. This simulation-based analysis shows that despite a longstanding controversy, rarefaction is the most robust approach to control for uneven sequencing effort. The controversy started because of confusion over the definition of rarefaction and violation of assumptions that are made by methods that have been borrowed from other fields. Microbial ecologists should use rarefaction.

Keywords: 16S rRNA gene sequencing; amplicon sequencing; bioinformatics; data science; microbial ecology; microbiome.

MeSH terms

  • Computational Biology / methods
  • High-Throughput Nucleotide Sequencing / methods
  • Microbiota* / genetics
  • RNA, Ribosomal, 16S / genetics
  • Sequence Analysis, DNA / methods

Substances

  • RNA, Ribosomal, 16S