Documente Academic
Documente Profesional
Documente Cultură
This article is about Bioinformatics. For the disease in horses known by the acronym
"PSSM", see Equine polysaccharide storage myopathy.
Contents
[hide]
• 1 Basic PWM with log-likelihoods
• 2 Incorporating background
distribution
• 4 Using PWMs
• 5 References
• 6 External links
− log(pi,j)
Finally, the IC of the PWM is then the sum of the expected self-information of every
element:
Often, it is more useful to calculate the information content with the background
letter frequencies of the sequences you are studying rather than assuming equal
probabilities of each letter (e.g., the GC-content of DNA ofthermophilic bacteria
range from 65.3 to 70.8[2], thus a motif of ATAT would contain much more
information than a motif of CCGG). The equation for information content thus
becomes
[edit]Using PWMs
There are various algorithms to scan for hits of PWMs in sequences. One example is
the MATCH™ algorithm [3]which has been implemented in the ModuleMaster[4]. More
sophisticated algorithms for fast database searching with nucleotide as well as
amino acid PWMs/PSSMs are implemented in the possumsearch software and are
described in [5].
In bioinformatics, scoring matrices for computing alignment scores are often based
on observed substitution rates, derived from the substitution frequencies seen in
multiple alignments of sequences. Every possible identity and substitution is
assigned a score based on the observed frequencies of such occurences in
alignments of related proteins. The score is calculated from the frequency of
occurrence of a match of the two individual amino acids in evolutionarily related
sequences, and provides a measure of a chance alignment of the two amino acids.
This score will also reflect the frequency that a particular amino acid occurs in
nature, as some amino acids are more abundant than others. Higher scores indicate
that the probability that those two amino acids aligned by chance is very small, and
lower scores indicate a high probability the two amino acids aligned by chance, and
are evolutionarily unrelated. Thus, identities are assigned the most positive scores,
frequently observed substitutions also receive positive scores, but matches that are
unlikely to have been a result of evolution, and are more likely indicative of
unrelatedness at that position, are given negative scores. Matrices with scoring
schemes based on observed substitution rates are superior to simple identity
scores, or scores based solely on sidechain moiety similarity.
An Introduction to Position Specific Scoring
Matrices
by Roderic Guigo, IMIM/UPF/CRG, Barcelona
DISCLAIMER: This document is only an exercise on javascript. There are bugs. It
has been only tested on Netscape clients---version 3 and higher--- running on Silicon
Graphics.
A Profile or Position Weigth Matrix (the two terms are used synonymously here) is a
motif descriptor. It attempts to capture the intrinsic variability characteristic of
sequence patterns. A Profile it is usually derived from a set of aligned sequences
functionally related. For instance, below we have the sequence of ten vertebrate donor
sites, aligned at the boundary exon/intron.
Top of Form
GAGGTAA
sequence 1:
TCCGTAAG
sequence 2:
CAGGTTGG
sequence 3:
ACAGTCA
sequence 4:
TAGGTCAT
sequence 5:
TAGGTACT
sequence 6:
ATGGTAA
sequence 7:
CAGGTATA
sequence 8:
TGTGTGAG
sequence 9:
AAGGTAA
sequence 10:
Position Weight Matrix
We derive a Profile from above set of sequences by tabulating the frequency with
which each nucleotide is observed at each position. Click on "Calculate Matrix" above
to obtain such observed frequencies.
Formally, from a set S of n aligned sequences of length l, s1, ... , sn, where sk = sk1, ... ,
skl (the skj being one of {A, C, G, T} in the case of DNA sequences) a Position Weigth
Matrix, M4xl is derived as
Each coefficient in this matrix indicates the number of times that a given nucleotide
has been observed at a given position. For instance, the nucleotide "A" has been
observed in three of the aligned sequences in position 1, and so is indicated in the
matrix. Note also, that in this case two positions are absolutely conserved, postions 4
and 5 corresponding to the mandatory dinucleotide GT at the begining of the intron.
Of course, different sets of aligned sequences result in different profiles. You can play
with the input sequences, and see how the profile changes when the aligned sequences
change.
More often than the absolute frequencies, the relative frequencies are tabulated in a
profile. In such a case, the coefficients of the matrix can be interpreted as probabilities
of a given nucleotide occurring at a given position in a functional site. Then, given a
sequence of length l, the product of the coefficients from such a matrix correspoding
to each nucleotide in each position of the sequence is the probability of finding such a
sequence in a true functional site. For instance, the probability of finding the sequence
CAGGTTGGA in the functional site described by the matrix above (assuming that
you have not changed the original input sequences) is
0.20x0.59x0.69x1x1x0.10x0.10x0.5x0.10. While the probability of finding such a
sequence in a random site is the product of the "a priori" probabilities of the
corresponding nucleotides. For instance, if we assume that all nucleotides are equally
probable, such a probability is simply 0.259. The ratio between the probability of a
sequence in a functional site and the probability of a sequence in a random site is a
likelihood ratio, and its logarithm a log likelihood ratio. Such a ratio is equal to zero if
a sequence has the same probability to appear in a functional site than in a random
site, is greater than zero if the sequence is more likely to be found in a functional site
than in a random site, and smaller than zero the other way around.
Often, thus, the coefficients in a Position Weigth Matrix are directly computed as log-
likelyhood values according with the following transformation log(Mij/pi), where Mij is
the probability of nucleotide i at position j in the Matrix M, andpi is the background
probability of nucleotide i. The background probability of nucleotide can be assumed
to be an "a priory" probability, the frequency of the nucleotide in the whole sequences
use to derive the matrix, or the frequency in the aligned region from where the matrix
is actually derived. Then, given a sequence of length l above log-likelihood ratio can
be computed by summing the coefficients of the log-likelihood matrix corresponding
to each nucleotide in each position on the sequence.
Below you can see how the absolute frequencies matrix that you have originally
derived is transformed to a relative frequencies matrix, or a log likelihood matrix
(assuming all nucleotides equiprobable).
Relative Frequencies
Once a Profile has been derived from a set of functionally related sites, the Profile can
be used to scan a query sequence for the presence of potential sites. Usually you run a
window the length of the matrix along the sequence, and sum the coefficients from the
matrix corresponding to each nucleotide in each position on the window sequence.
Formally, the score of a matrix M for a site s of length l (s = s1, ... , sl, and sk being one
of {A, C, G, T}) is computed as
You can use any form of above matrix to search for occurrences of the motif in a
given sequence, but if you use the log-likelihood matrix, the scores that you will
obtain are log-likelihood ratios. You can use the sequence below or your own
sequence, and see how the scores along each position in the sequence are caculated.
1 2 3 4 5 6 7 8 9
A
Score
Bottom of Form
As a result of scaning the sequence with the matrix, you obtain an score at each
position. Click on 'ScanSequence' for the whole list of scores along the sequence.
Top of Form
This article is about Bioinformatics. For the disease in horses known by the acronym
"PSSM", see Equine polysaccharide storage myopathy.
Contents
• 2 Incorporating background
distribution
• 4 References
− log(pi,j)
Finally, the IC of the PWM is then the sum of the expected self-information of every
element:
Often, it is more useful to calculate the information content with the background
letter frequencies of the sequences you are studying rather than assuming equal
probabilities of each letter (e.g. the GC-content of DNA of thermophilic bacteria
range from 65.3 to 70.8[2], thus a motif of ATAT would contain much more
information than a motif of CCGG). The equation for information content thus
becomes