DNA sequence confidence estimation

R J Lipshutz; F Taverner; K Hennessy; G Hartzell; R Davis

doi:10.1006/geno.1994.1089

DNA sequence confidence estimation

Genomics. 1994 Feb;19(3):417-24. doi: 10.1006/geno.1994.1089.

Authors

R J Lipshutz¹, F Taverner, K Hennessy, G Hartzell, R Davis

Affiliation

¹ Affymetrix, Santa Clara, California 95051.

PMID: 8188283
DOI: 10.1006/geno.1994.1089

Abstract

A significant bottleneck in the current DNA sequencing process is the manual editing of trace data generated by automated DNA sequencers. This step is used to correct base calls and to associate to each base call a confidence level. The confidence levels are used in the assembly process to determine overlaps and to resolve discrepancies in determining the consensus sequence. This single step may cost as much as 4 to 8 cents per finished base. We report an approach to automated trace editing using classification trees to detect and exploit context-based patterns in trace peak heights. Local base composition and nearby peak heights account for 80% of the variations in peak heights. Classification algorithms were developed to identify 37% of automated base calls that differ from the consensus sequence. With these algorithms, 12% of the base calls had confidence levels less than 90%.

Publication types

Comparative Study

MeSH terms

Algorithms*
Analysis of Variance
Artifacts
Automation
Consensus Sequence
Cosmids / genetics
Decision Trees
Sequence Alignment
Sequence Analysis, DNA* / economics
Sequence Analysis, DNA* / methods
Sequence Analysis, DNA* / statistics & numerical data