Optimal sequencing strategies for identifying disease-associated singletons

Sara Rashkin; Goo Jun; Sai Chen; Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO); Goncalo R Abecasis

doi:10.1371/journal.pgen.1006811

Optimal sequencing strategies for identifying disease-associated singletons

PLoS Genet. 2017 Jun 22;13(6):e1006811. doi: 10.1371/journal.pgen.1006811. eCollection 2017 Jun.

Authors

Sara Rashkin^{1

2}, Goo Jun^{1

3}, Sai Chen^{1

4}; Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO); Goncalo R Abecasis¹

Affiliations

¹ Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America.
² Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, California, United States of America.
³ Human Genetics Center, School of Public Health, University of Texas Health Science Center at Houston, Houston, Texas, United States of America.
⁴ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America.

Abstract

With the increasing focus of genetic association on the identification of trait-associated rare variants through sequencing, it is important to identify the most cost-effective sequencing strategies for these studies. Deep sequencing will accurately detect and genotype the most rare variants per individual, but may limit sample size. Low pass sequencing will miss some variants in each individual but has been shown to provide a cost-effective alternative for studies of common variants. Here, we investigate the impact of sequencing depth on studies of rare variants, focusing on singletons-the variants that are sampled in a single individual and are hardest to detect at low sequencing depths. We first estimate the sensitivity to detect singleton variants in both simulated data and in down-sampled deep genome and exome sequence data. We then explore the power of association studies comparing burden of singleton variants in cases and controls under a variety of conditions. We show that the power to detect singletons increases with coverage, typically plateauing for coverage > ~25x. Next, we show that, when total sequencing capacity is fixed, the power of association studies focused on singletons is typically maximized for coverage of 15-20x, independent of relative risk, disease prevalence, singleton burden, and case-control ratio. Our results suggest sequencing depth of 15-20x as an appropriate compromise of singleton detection power and sample size for studies of rare variants in complex disease.

MeSH terms

Exome / genetics*
Genetic Diseases, Inborn*
Genome, Human
Genome-Wide Association Study
Genotype
High-Throughput Nucleotide Sequencing*
Humans
Polymorphism, Single Nucleotide / genetics
Sequence Analysis, DNA / methods*

Abstract

MeSH terms

Grants and funding