BLAST and FASTA similarity searching for multiple sequence alignment

William R Pearson

doi:10.1007/978-1-62703-646-7_5

BLAST and FASTA similarity searching for multiple sequence alignment

Methods Mol Biol. 2014:1079:75-101. doi: 10.1007/978-1-62703-646-7_5.

Author

William R Pearson¹

Affiliation

¹ Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, VA, USA.

PMID: 24170396
DOI: 10.1007/978-1-62703-646-7_5

Abstract

BLAST, FASTA, and other similarity searching programs seek to identify homologous proteins and DNA sequences based on excess sequence similarity. If two sequences share much more similarity than expected by chance, the simplest explanation for the excess similarity is common ancestry-homology. The most effective similarity searches compare protein sequences, rather than DNA sequences, for sequences that encode proteins, and use expectation values, rather than percent identity, to infer homology. The BLAST and FASTA packages of sequence comparison programs provide programs for comparing protein and DNA sequences to protein databases (the most sensitive searches). Protein and translated-DNA comparisons to protein databases routinely allow evolutionary look back times from 1 to 2 billion years; DNA:DNA searches are 5-10-fold less sensitive. BLAST and FASTA can be run on popular web sites, but can also be downloaded and installed on local computers. With local installation, target databases can be customized for the sequence data being characterized. With today's very large protein databases, search sensitivity can also be improved by searching smaller comprehensive databases, for example, a complete protein set from an evolutionarily neighboring model organism. By default, BLAST and FASTA use scoring strategies target for distant evolutionary relationships; for comparisons involving short domains or queries, or searches that seek relatively close homologs (e.g. mouse-human), shallower scoring matrices will be more effective. Both BLAST and FASTA provide very accurate statistical estimates, which can be used to reliably identify protein sequences that diverged more than 2 billion years ago.

MeSH terms

Amino Acid Sequence
Computational Biology / methods*
Data Mining
Databases, Protein
Humans
Molecular Sequence Data
Sequence Alignment / methods*
Sequence Homology, Amino Acid
Software*