Estimating genotype error rates from high-coverage next-generation sequence data

Genome Res. 2014 Nov;24(11):1734-9. doi: 10.1101/gr.168393.113. Epub 2014 Oct 10.

Abstract

Exome and whole-genome sequencing studies are becoming increasingly common, but little is known about the accuracy of the genotype calls made by the commonly used platforms. Here we use replicate high-coverage sequencing of blood and saliva DNA samples from four European-American individuals to estimate lower bounds on the error rates of Complete Genomics and Illumina HiSeq whole-genome and whole-exome sequencing. Error rates for nonreference genotype calls range from 0.1% to 0.6%, depending on the platform and the depth of coverage. Additionally, we found (1) no difference in the error profiles or rates between blood and saliva samples; (2) Complete Genomics sequences had substantially higher error rates than Illumina sequences had; (3) error rates were higher (up to 6%) for rare or unique variants; (4) error rates generally declined with genotype quality (GQ) score, but in a nonlinear fashion for the Illumina data, likely due to loss of specificity of GQ scores greater than 60; and (5) error rates increased with increasing depth of coverage for the Illumina data. These findings, especially (3)-(5), suggest that caution should be taken in interpreting the results of next-generation sequencing-based association studies, and even more so in clinical application of this technology in the absence of validation by other more robust sequencing or genotyping methods.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Exome / genetics*
  • Gene Frequency
  • Genome, Human / genetics
  • Genomics / methods*
  • Genotype
  • Genotyping Techniques / methods*
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Polymorphism, Single Nucleotide
  • Reproducibility of Results
  • White People / genetics

Associated data

  • dbGaP/PHS000786.V1.P1