Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data

J S Aaronson; B Eckman; R A Blevins; J A Borkowski; J Myerson; S Imran; K O Elliston

doi:10.1101/gr.6.9.829

Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data

Genome Res. 1996 Sep;6(9):829-45. doi: 10.1101/gr.6.9.829.

Authors

J S Aaronson¹, B Eckman, R A Blevins, J A Borkowski, J Myerson, S Imran, K O Elliston

Affiliation

¹ Merck Research Laboratories, Department of Bioinformatics, Rahway, New Jersey 07065, USA. aaronson@merck.com

PMID: 8889550
DOI: 10.1101/gr.6.9.829

Abstract

A rigorous analysis of the Merck-sponsored EST data with respect to known gene sequences increases the utility of the data set and helps refine methods for building a gene index. A highly curated human transcript data base was used as a reference data set of known genes. A detailed analysis of EST sequences derived from known genes was performed to assess the accuracy of EST sequence annotation. The EST data was screened to remove low-quality and low-complexity sequences. A set of high-quality ESTs similar to the transcript data base was identified using BLAST; this subset of ESTs was compared with the set of known genes using the Smith-Waterman algorithm. Error rates of several types were assessed based on a flexible match criterion defining sequence identity. The rate of lane-tracking errors is very low, approximately 0.5%. Insert size data is accurate within approximately 20%. Reversed clone and internal priming error rates are approximately 5% and 2.5%, respectively, contributing to the incorrect identification of reads as 3' ends of genes. Follow-up investigation reveals that a significant number of clones, miscategorized as reversed, represent overlapping genes on the opposite strand of entries in the transcript data base. Relevance of these results to the creation of a high-quality index to the human genome capable of supporting diverse genomic investigations is discussed.

Publication types

Case Reports

MeSH terms

Algorithms
Base Sequence*
Chimera
Chromosome Mapping*
Cloning, Molecular
Databases, Factual*
Female
Genome, Human*
Humans
Infant
Reproducibility of Results
Sequence Tagged Sites*
Transcription, Genetic