Sunteți pe pagina 1din 10

Position-specific scoring matrix

From Wikipedia, the free encyclopedia

This article is about Bioinformatics. For the disease in horses known by the acronym
"PSSM", see Equine polysaccharide storage myopathy.

A position weight matrix (PWM), also called position-specific weight matrix


(PSWM) or position-specific scoring matrix (PSSM), is a commonly used
representation of motifs (patterns) in biological sequences.[1]

A PWM is a matrix of score values that gives a weighted match to any


given substring of fixed length. It has one row for each symbol of the alphabet, and
one column for each position in the pattern. The score assigned by a PWM to

a substring is defined as , where j represents position in the


substring, sjis the symbol at position j in the substring, and mα,j is the score in row α,
column j of the matrix. In other words, a PWM score is the sum of position-specific
scores for each symbol in the substring.

Contents

[hide]
• 1 Basic PWM with log-likelihoods

• 2 Incorporating background
distribution

• 3 Information content of a PWM

• 4 Using PWMs

• 5 References

• 6 External links

[edit]Basic PWM with log-likelihoods

A PWM assumes independence between positions in the pattern, as it calculates


scores at each position independently from the symbols at other positions. The
score of a substring aligned with a PWM can be interpreted as the log-likelihood of
the substring under a product multinomial distribution. Since each column defines
log-likelihoods for each of the different symbols, where the sum of likelihoods in a
column equals one, the PWM corresponds to a Multinomial distribution. A PWM's
score is the sum of log-likelihoods, which corresponds to the product of likelihoods,
meaning that the score of a PWM is then a product-multinomial distribution. The
PWM scores can also be interpreted in a physical framework as the sum of binding
energies for all nucleotides(symbols of the substring) aligned with the PWM.

[edit]Incorporating background distribution

Instead of using log-likelihood values in the PWM, as described in the previous


paragraph, several methods useslog-odds scores in the PWMs. An element in a PWM
is then calculated as mi,j = log(pi,j / bi), where pi,j is the probability of observing
symbol i at position j of the motif, and bi is the probability of observing the symbol i
in a background model. The PWM score then corresponds to the log-odds of the
substring being generated by the motif versus being generated by the background,
in a generative model of the sequence.

[edit]Information content of a PWM

The information content (IC) of a PWM is sometimes of interest, as it says something


about how different a given PWM is from a uniform distribution.

The self-information of observing a particular symbol at a particular position of the


motif is:

− log(pi,j)

The expected (average) self-information of a particular element in the PWM is then:

Finally, the IC of the PWM is then the sum of the expected self-information of every
element:

Often, it is more useful to calculate the information content with the background
letter frequencies of the sequences you are studying rather than assuming equal
probabilities of each letter (e.g., the GC-content of DNA ofthermophilic bacteria
range from 65.3 to 70.8[2], thus a motif of ATAT would contain much more
information than a motif of CCGG). The equation for information content thus
becomes

where pb is the background frequency for that letter.

[edit]Using PWMs
There are various algorithms to scan for hits of PWMs in sequences. One example is
the MATCH™ algorithm [3]which has been implemented in the ModuleMaster[4]. More
sophisticated algorithms for fast database searching with nucleotide as well as
amino acid PWMs/PSSMs are implemented in the possumsearch software and are
described in [5].

In bioinformatics, scoring matrices for computing alignment scores are often based
on observed substitution rates, derived from the substitution frequencies seen in
multiple alignments of sequences. Every possible identity and substitution is
assigned a score based on the observed frequencies of such occurences in
alignments of related proteins. The score is calculated from the frequency of
occurrence of a match of the two individual amino acids in evolutionarily related
sequences, and provides a measure of a chance alignment of the two amino acids.
This score will also reflect the frequency that a particular amino acid occurs in
nature, as some amino acids are more abundant than others. Higher scores indicate
that the probability that those two amino acids aligned by chance is very small, and
lower scores indicate a high probability the two amino acids aligned by chance, and
are evolutionarily unrelated. Thus, identities are assigned the most positive scores,
frequently observed substitutions also receive positive scores, but matches that are
unlikely to have been a result of evolution, and are more likely indicative of
unrelatedness at that position, are given negative scores. Matrices with scoring
schemes based on observed substitution rates are superior to simple identity
scores, or scores based solely on sidechain moiety similarity.
An Introduction to Position Specific Scoring
Matrices
by Roderic Guigo, IMIM/UPF/CRG, Barcelona
DISCLAIMER: This document is only an exercise on javascript. There are bugs. It
has been only tested on Netscape clients---version 3 and higher--- running on Silicon
Graphics.
A Profile or Position Weigth Matrix (the two terms are used synonymously here) is a
motif descriptor. It attempts to capture the intrinsic variability characteristic of
sequence patterns. A Profile it is usually derived from a set of aligned sequences
functionally related. For instance, below we have the sequence of ten vertebrate donor
sites, aligned at the boundary exon/intron.
Top of Form
GAGGTAA

sequence 1:
TCCGTAAG

sequence 2:
CAGGTTGG

sequence 3:
ACAGTCA

sequence 4:
TAGGTCAT

sequence 5:
TAGGTACT

sequence 6:
ATGGTAA

sequence 7:
CAGGTATA

sequence 8:
TGTGTGAG

sequence 9:
AAGGTAA

sequence 10:
Position Weight Matrix

We derive a Profile from above set of sequences by tabulating the frequency with
which each nucleotide is observed at each position. Click on "Calculate Matrix" above
to obtain such observed frequencies.
Formally, from a set S of n aligned sequences of length l, s1, ... , sn, where sk = sk1, ... ,
skl (the skj being one of {A, C, G, T} in the case of DNA sequences) a Position Weigth
Matrix, M4xl is derived as

Each coefficient in this matrix indicates the number of times that a given nucleotide
has been observed at a given position. For instance, the nucleotide "A" has been
observed in three of the aligned sequences in position 1, and so is indicated in the
matrix. Note also, that in this case two positions are absolutely conserved, postions 4
and 5 corresponding to the mandatory dinucleotide GT at the begining of the intron.
Of course, different sets of aligned sequences result in different profiles. You can play
with the input sequences, and see how the profile changes when the aligned sequences
change.
More often than the absolute frequencies, the relative frequencies are tabulated in a
profile. In such a case, the coefficients of the matrix can be interpreted as probabilities
of a given nucleotide occurring at a given position in a functional site. Then, given a
sequence of length l, the product of the coefficients from such a matrix correspoding
to each nucleotide in each position of the sequence is the probability of finding such a
sequence in a true functional site. For instance, the probability of finding the sequence
CAGGTTGGA in the functional site described by the matrix above (assuming that
you have not changed the original input sequences) is
0.20x0.59x0.69x1x1x0.10x0.10x0.5x0.10. While the probability of finding such a
sequence in a random site is the product of the "a priori" probabilities of the
corresponding nucleotides. For instance, if we assume that all nucleotides are equally
probable, such a probability is simply 0.259. The ratio between the probability of a
sequence in a functional site and the probability of a sequence in a random site is a
likelihood ratio, and its logarithm a log likelihood ratio. Such a ratio is equal to zero if
a sequence has the same probability to appear in a functional site than in a random
site, is greater than zero if the sequence is more likely to be found in a functional site
than in a random site, and smaller than zero the other way around.
Often, thus, the coefficients in a Position Weigth Matrix are directly computed as log-
likelyhood values according with the following transformation log(Mij/pi), where Mij is
the probability of nucleotide i at position j in the Matrix M, andpi is the background
probability of nucleotide i. The background probability of nucleotide can be assumed
to be an "a priory" probability, the frequency of the nucleotide in the whole sequences
use to derive the matrix, or the frequency in the aligned region from where the matrix
is actually derived. Then, given a sequence of length l above log-likelihood ratio can
be computed by summing the coefficients of the log-likelihood matrix corresponding
to each nucleotide in each position on the sequence.
Below you can see how the absolute frequencies matrix that you have originally
derived is transformed to a relative frequencies matrix, or a log likelihood matrix
(assuming all nucleotides equiprobable).

Relative Frequencies

Once a Profile has been derived from a set of functionally related sites, the Profile can
be used to scan a query sequence for the presence of potential sites. Usually you run a
window the length of the matrix along the sequence, and sum the coefficients from the
matrix corresponding to each nucleotide in each position on the window sequence.
Formally, the score of a matrix M for a site s of length l (s = s1, ... , sl, and sk being one
of {A, C, G, T}) is computed as
You can use any form of above matrix to search for occurrences of the motif in a
given sequence, but if you use the log-likelihood matrix, the scores that you will
obtain are log-likelihood ratios. You can use the sequence below or your own
sequence, and see how the scores along each position in the sequence are caculated.

use the buttons to scan the Profile along the sequence


A C T C A G C C C

1 2 3 4 5 6 7 8 9
A

Score
Bottom of Form

As a result of scaning the sequence with the matrix, you obtain an score at each
position. Click on 'ScanSequence' for the whole list of scores along the sequence.
Top of Form

Sometimes you may want to plot the scores graphically


Bottom of Form
Top of Form
But usually you are only interested in the scores over a given threshold. Often, you set
such a threshold at the minimum value scored by the sequences from which the profile
has been derived. For instance, the minimum value scored by the original sequences

from which above profile has been derived is


Using this threshold, you obtain a reduced list of matches,

and a clearer plot


Of course, you can play with the threshold and increase or decrease the number of
potential matches, incresing and decreasing accordingly sensitivity and specificity
(Change the value of the threshold and click on "ScanSequence" afterwords.)
Bottom of Form

Position-specific scoring matrix

This article is about Bioinformatics. For the disease in horses known by the acronym
"PSSM", see Equine polysaccharide storage myopathy.

A position weight matrix (PWM), also called position-specific weight matrix


(PSWM) or position-specific scoring matrix (PSSM), is a commonly used
representation of motifs (patterns) in biological sequences.[1]

A PWM is a matrix of score values that gives a weighted match to any


given substring of fixed length. It has one row for each symbol of the alphabet, and
one column for each position in the pattern. The score assigned by a PWM to
a substring is defined as , where j represents position in the substring, sj is the
symbol at position j in the substring, and mα,j is the score in row α, column jof the
matrix. In other words, a PWM score is the sum of position-specific scores for each
symbol in the substring.

Contents

• 1 Basic PWM with log-likelihoods

• 2 Incorporating background
distribution

• 3 Information content of a PWM

• 4 References

Basic PWM with log-likelihoods


A PWM assumes independence between positions in the pattern, as it calculates
scores at each position independently from the symbols at other positions. The
score of a substring aligned with a PWM can be interpreted as the log-likelihood of
the substring under a product multinomial distribution. Since each column defines
log-likelihoods for each of the different symbols, where the sum of likelihoods in a
column equals one, the PWM corresponds to a multinomial distribution. A PWM's
score is the sum of log-likelihoods, which corresponds to the product of likelihoods,
meaning that the score of a PWM is then a product-multinomial distribution. The
PWM scores can also be interpreted in a physical framework as the sum of binding
energies for all nucleotides(symbols of the substring) aligned with the PWM.

Incorporating background distribution

Instead of using log-likelihood values in the PWM, as described in the previous


paragraph, several methods uses log-odds scores in the PWMs. An element in a
PWM is then calculated as mi,j = log(pi,j / bi), where pi,j is the probability of observing
symbol i at position j of the motif, and bi is the probability of observing the symbol i
in a background model. The PWM score then corresponds to the log-odds of the
substring being generated by the motif versus being generated by the background,
in a generative model of the sequence.

Information content of a PWM

The information content (IC) of a PWM is sometimes of interest, as it says something


about how different a given PWM is from auniform distribution.

The self-information of observing a particular symbol at a particular position of the


motif is:

− log(pi,j)

The expected (average) self-information of a particular element in the PWM is then:

Finally, the IC of the PWM is then the sum of the expected self-information of every
element:

Often, it is more useful to calculate the information content with the background
letter frequencies of the sequences you are studying rather than assuming equal
probabilities of each letter (e.g. the GC-content of DNA of thermophilic bacteria
range from 65.3 to 70.8[2], thus a motif of ATAT would contain much more
information than a motif of CCGG). The equation for information content thus
becomes

where pb is the background frequency for that letter.

S-ar putea să vă placă și