Sunteți pe pagina 1din 28

BIO4320 Lecture Materials, Prepared by Dr.

Hon-Ming Lam

Basic Principles of BLAST


Analysis
Additional information can be obtained from the information
pages at www.ncbi.nlm.nih.gov/Blast
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam

Analyzing the Sequenced Genes


• Structure prediction
– Secondary structure of DNA and RNA
– Possible 3-D structure of proteins
• Identity of the encoded gene/gene product
– Prediction of general physical properties (e.g. M.W., pI; may be
important for proteonomic analysis)
– Database (e.g. Genbank) search based on sequence homology
• Possible function of the encoded gene product
– Search for signature domains or function motifs using consensus
patterns (based on statistics)
• Possible location of the encoded gene product
– Prediction of subcellular localization by consensus patterns
• Prediction of evolutionary relationship
– Multiple alignment, clustering, etc.
• Gene prediction from genomic sequences
– Prediction for coding regions and location of introns
– Prediction for promoter regions
• Prediction of regulatory sites
– Prediction of consensus cis-acting regulatory elements
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam

blastn: good for high score search; not for


comparison of distant relationship
blastp: use substitution matrix to find distant
relationship; can use SEG to filter low
complexity region
blastx: use for new DNA sequences and
analysis of ESTs
tblastn: search for coding regions that are not
defined in the database
tblastx: use for analysis of ESTs
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam

BLAST Search
• www.ncbi.nlm.nih.gov/Blast
• Basic Local Alignment Search Tool
• Uses heuristic algorithm which seeks local
(instead of global) alignments; able to detect
relationships among sequences which shares
similarity only in isolated regions
• The initial search is done for a word of length
“W” that scores at least “T” when compared to
the query using a substitution matrix
• Word hits are then extended in either
direction in an attempt to generate an
alignment with a score exceeding the
threshold of “S”
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
Word Size
= Word Length
= 11

Expect = The
statistical
significance
threshold for
reporting matches
against database
sequences; the
default value is 10,
meaning that 10
matches are
expected to be
found merely by
chance

Expect=Kmne-λT
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
Bit Score
The value S’ is derived from the raw alignment score S in
which the statistical properties of the scoring system
used have been taken into account. Because bit scores
have been normalized with respect to the scoring system,
they can be used to compare alignment scores from
different searches.
S’=(λS-lnK)/ln2 [λ and K are normalizing parameters]
E Value
Expectation value. The number of different
alignments with scores equivalent to or better than S’
that are expected to occur in a database search by
chance. The lower the E value, the more significant
the score.
E=mn2-S’ [m: effective length of the query;
n: total number of bases of the database]
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam

CDD Search
Compares protein sequences to the
Conserved Domain Database. The CDD
is a database containing a collection of
functional and/or structural domains
derived from two popular collections,
Smart and Pfam, plus contributions from
colleagues at NCBI.
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam

PSI-BLAST
Position specific iterative BLAST refers to a feature
of BLAST 2.0 in which a profile (or position specific
scoring matrix, PSSM) is constructed
(automatically) from a multiple alignment of the
highest scoring hits in an initial BLAST search. The
PSSM is generated by calculating position-specific
scores for each position in the alignment. Highly
conserved positions receive high scores and
weakly conserved positions receive scores near
zero. The profile is used to perform a second (etc.)
BLAST search and the results of each "iteration"
used to refine the profile. This iterative searching
strategy results in increased sensitivity.
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam

PSSM
Position-specific scoring matrix. Based
on a Profile (A table that lists the
frequencies of each amino acid in each
position of protein sequence. Frequencies
are calculated from multiple alignments of
sequences containing a domain of
interest). The PSSM gives the log-odds
score for finding a particular matching
amino acid in a target sequence.
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam

S-ar putea să vă placă și