Benchmarking PSI-BLAST in genome annotation

J Mol Biol. 1999 Nov 12;293(5):1257-71. doi: 10.1006/jmbi.1999.3233.

Abstract

The recognition of remote protein homologies is a major aspect of the structural and functional annotation of newly determined genomes. Here we benchmark the coverage and error rate of genome annotation using the widely used homology-searching program PSI-BLAST (position-specific iterated basic local alignment search tool). This study evaluates the one-to-many success rate for recognition, as often there are several homologues in the database and only one needs to be identified for annotating the sequence. In contrast, previous benchmarks considered one-to-one recognition in which a single query was required to find a particular target. The benchmark constructs a model genome from the full sequences of the structural classification of protein (SCOP) database and searches against a target library of remote homologous domains (<20 % identity). The structural benchmark provides a reliable list of correct and false homology assignments. PSI-BLAST successfully annotated 40 % of the domains in the model genome that had at least one homologue in the target library. This coverage is more than three times that if one-to-one recognition is evaluated (11 % coverage of domains). Although a structural benchmark was used, the results equally apply to just sequence homology searches. Accordingly, structural and sequence assignments were made to the sequences of Mycoplasma genitalium and Mycobacterium tuberculosis (see http://www.bmm.icnet. uk). The extent of missed assignments and of new superfamilies can be estimated for these genomes for both structural and functional annotations.

MeSH terms

  • Algorithms
  • Bacterial Proteins / chemistry*
  • Bacterial Proteins / classification
  • Bacterial Proteins / genetics
  • Bacterial Proteins / metabolism*
  • Benchmarking
  • Computational Biology*
  • Conserved Sequence
  • Databases, Factual
  • False Positive Reactions
  • Genome, Bacterial*
  • Internet
  • Multigene Family
  • Mycobacterium tuberculosis / chemistry
  • Mycobacterium tuberculosis / enzymology
  • Mycobacterium tuberculosis / genetics
  • Mycoplasma / chemistry
  • Mycoplasma / enzymology
  • Mycoplasma / genetics
  • Open Reading Frames / genetics
  • Sensitivity and Specificity
  • Sequence Alignment
  • Sequence Homology, Amino Acid*
  • Software*
  • Structure-Activity Relationship

Substances

  • Bacterial Proteins