Bill Pearson's FASTA Web site: [http://www.people.Virginia.EDU/~wrp/pearson.html]
This tutorial assumes that local copies of the GenBank and GenPept databases are installed. |
This tutorial will introduce the several types
of sequence database searches, using the antifreeze
protein from the sea raven, Hemitripterus
americanus. The cDNA sequence for this gene is found in SRAAFP.gen.
Create a directory called 'db' and save this
sequence to the db directory.
To read this sequence into bldna, choose 'File --> Open', and select SRAAFP.gen.

|
Since this tutorial will demonstrate database searches using
both
DNA and protein sequences as query sequences, we need to generate the
amino acid sequence. This is done in two steps. First, we need to
extract the protein coding sequence from the cDNA. (Remember that mRNAs
contain both 5' and 3' untranslated sequences. If you look at SRAAFP.gen,
you'll see that the CDS (protein coding sequence) runs from positions
10 to 597. We will use the FEATURES program to extract the CDS, and
then translate the CDS using RIBOSOME. |
![]() |
Click on 'Run'. The CDS will appear in a new bldna window.

|
Note that the CDS begins with a start codon, indicating that
this cDNA contains the entire protein coding region. Next translate the CDS into protein. Select SRAAFP:CDS1, and choose DNARNA --> Ribosome. By default, Ribosome sends the output to a new blprotein window. |
![]() |


| First, choose a database. All GenBank divisions are listed, along with a choice to search ALL divisions, or a User-created file in Pearson/FASTA format. Most of the time, it is best to search the taxonomic group from which your query sequence is derived. This is especially true for DNA/DNA searches, since only closely-related DNA sequences are likely to give statistically-significant matches. If you do need to search all of GenBank, this can take more than an hour, depending on the length of the sequence and the K-tuple value (each K-tuple speeds up the search by a factor of 4, but decreases sensitivity). | ![]() |
| Hint: Send
output from long-running
searches to a file.
|
| Hint: For very
closely-related
sequences, speed up the search with K-tuple = 5 or 6.
In some cases, you are specifically looking for sequences that share very high similarity with your query. Suppose you had a cDNA and wanted to find the corresponding genomic clone. In that case, the similarity is likely to be greater than 90%. A K-tuple value of 5 or 6 will give a very fast search at low sensitivity, which is all you need. |

For DNA searches a
very simple scoring scheme is used, in which +1 is added to the score
for a match, and -1 is added to the score for a mismatch. The greater
diversity of amino acids lends itself to more complex scoring schemes.
The menu shows a variety of protein scoring matrices. The Blosum
matrices
were constructed using distantly-related sequences. If you need a
highly
sensitive search, use a low-numbered Blosum matrix. The PAM matrices
were
constructed from a set of closely-related sequences. For a
high-sensitivity
search, use PAM250, or PAM250 Gonnet (based on a more recent dataset.).
For closely-related sequences, use PAM120.
More information of scoring matrices can be found in
Hugh B. Nicholas Jr., David W. Deerfield II., and Alexander J.
Ropelewski, (1998) A Tutorial on Searching Sequence Databases and
Sequence
Scoring Methods Developed by the Biomedical Supercomputing
Initiative
of the Pittsburgh Supercomputing Center.[http://www.nrbsc.org/education/tutorials/sequence/db/].
The results of this protein search are found in SRAAFP.genpept.fasta
and SRAAFP.genpept.fasta.nam.
Hint: Oligopeptides need logarithmic score weightingFor very short query sequences such as oligopeptides, unweighted scores would not exceed the minimum cutoff values needed to appear in the output. The FASTA programs allow you to specify a logarithmic weighting ratio that gives short query sequences higher matches. The score is weighted by the natural log of the query length divided by the natural log of the database size. |

Two versions of
TFASTA are available. TFASTX
allows
3-base insertions (ie. 1 codon), while TFASTY allows 1 or 2 base
insertions.
TFASTY is therefore the most sensitive program, but it is also
slower. TFASTY
would be especially good at detecting similarities between a query
protein
and ESTs, since ESTs typically have more frameshifts than most
sequences.
The results of this protein search are found in SRAAFP.gbvrt.tfasta and SRAAFP.gbvrt.tfasta.nam.
| Note on automatic translation: Programs that translate sequences on the fly (eg. TFASTX, TBLASTN, TBLASTX) have no knowledge whatsoever about gene structure (ie. exons, introns, 5' UTR, 3'UTR). All these programs do is take every group of 3 nucleotides and assign a codon to it. Even stop codons are represented by an asterisk (*). Consequently, non-coding sequences and non-coding open reading frames are translated into meaningless amino acid sequences. |

FASTX and FASTY are the converse of TFASTX and TFASTY, in that they translate a DNA query sequence, allowing either codon-sized gaps (TFASTX) or 1 or 2 base gaps (TFASTY). TFASTY is therefore the slower, but more sensitive. These programs are particularly well-suited for comparing a sequence that may have frameshifts (eg. an EST) with a protein database.
The result of this protein search is found in SRAAFP.genpept.fasty.