Documente Academic
Documente Profesional
Documente Cultură
Basics in Bioinformatics
Heping Zhang
9/24/2007
Sequence Alignment
alignment
and sequence assembly
Microarray data analysis
Sequence
Sequencing
9/24/2007
match
mismatch
indels
4
1
2
2
n=8
m=7
C
T G A T G
V A T
T G C A T
A
C
W
9/24/2007
V = ATCTGATG
W = TGCATAC
matches
mismatches
insertions
deletions
deletion
insertion
9/24/2007
M
A
1
1
K
C
C
R
B
C
C
0
0
M ij = M ij + max ( M kj +1 , M i +1l )
k > i ,l > j
9/24/2007
http://www2.cs.uh.edu/~fofanov/
9/24/2007
http://www2.cs.uh.edu/~fofanov/
Results
9/24/2007
A B C N J
At each step
move one row
and column to
the lower right
and pick
maximum in this
row or column.
http://www2.cs.uh.edu/~fofanov/
R Q C L C R
A J C
J N R
A B C
N J R Q C L C R
A J C J N
P M
C K C R B P
P M
C K C R B P
9/24/2007
http://www2.cs.uh.edu/~fofanov/
10
H E A G A W G H E E
P
A
W
Scoring system
Match: 1
Mismatch: -1/3
Gap penalty: -(1+1/3*k)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
2.00
0.67
0.33
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.67
1.67
A
E
http://www2.cs.uh.edu/~fofanov/
11
9/24/2007
http://www2.cs.uh.edu/~fofanov/
12
Multiple Alignments
A multiple sequence alignment is a
sequence alignment of three or more
biological sequences
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
2.00
0.67
0.33
0.00
0.00
0.00
0.00
0.00
0.00
0.33
0.67
1.67
1.67
0.33
0.00
0.00
1.00
0.00
0.00
0.00
0.33
0.33
1.33
2.67
1.33
0.00
0.00
2.00
0.67
1.00
0.00
0.00
0.00
1.33
2.33
0.00
1.00
0.67
1.67
0.33
0.67
0.00
0.00
1.00
2.33
AWGHE
AW-HE
9/24/2007
http://www2.cs.uh.edu/~fofanov/
13
DNA Sequencing
http://www2.cs.uh.edu/~fofanov/
14
Sanger Method
1. Start at primer
(restriction site)
2. Grow DNA chain
3. Include ddNTPs
Techniques
Chain-termination methods
5. Separate products
by length, using gel
electrophoresis
9/24/2007
15
Shotgun Sequencing
9/24/2007
9/24/2007
9/24/2007
16
Fragment Assembly
DNA is broken up
randomly into numerous
small segments, which
are sequenced using the
chain termination method
to obtain reads.
9/24/2007
18
Layout
A variation of the shortest common
superstring problem
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
C:
C:
T:
C:
C:
20
35
30
35
40
C:
C:
C:
C:
C:
20
35
0
35
40
A:
A:
A:
A:
15
25
40
25
A:
A:
A:
A:
A:
15
25
0
40
25
Problem: Given a
set of strings, find
a shortest string
that contains all of
them
Complexity: NPcomplete
Score alignments
Accept alignments with good scores
9/24/2007
19
9/24/2007
20
SSP
TSP
ATC
AGT
CCA
AGT
ATC
ATCCAGT
2
2
CCA
2
1
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TCC
CAG
CAG
TCC
ATCCAGT
9/24/2007
21
Sequence
22
Aims
Dissect
9/24/2007
body
9/24/2007
23
9/24/2007
24
Celera Genomics
History
9/24/2007
25
First microarray
prototype (1989)
First commercial
DNA microarray
prototype w/16,000
features (1994)
500,000 features
per chip (2002)
9/24/2007
26
9/24/2007
27
9/24/2007
l-mer composition
28
29
9/24/2007
30
ATG
AGG
TGC
TCC
GTC
GGT
CAG }
GCA
CAG
ATGCAGG TC C
Positional
9/24/2007
Shotgun
31
9/24/2007
9/24/2007
32
Affymetrix GeneChip
33
Hybridization
9/24/2007
SBH
SBH
9/24/2007
34
Microarray Images
35
9/24/2007
36
9/24/2007
http://www-microarrays.u-strasbg.fr/base.php?page=affyExpressionE.php
37
RNA Degradation
9/24/2007
38
Image Inspection
25
20
15
10
5
0
30
RNA d e g r a d a tio n p lo t
4
6
5 ' <-----> 3 '
P ro b e N um b e r
10
9/24/2007
39
Detection Calls
9/24/2007
40
Diagnostics plots
Image plot
PM/MM intensity plot
Histogram plot
Density plot
Scatter Plot
Boxplot
Discrimination score
R = (PM MM) / (PM + MM)
User-definable threshold (default 0.015)
R ~ , all probe pairs for a given probe set
One-sided Wilcoxons Signed Rank test assigns each probe
pair a rank based on how far the probe pair Discrimination
score is from
9/24/2007
Affymetrix
41
9/24/2007
42
Density plot
0.0
0.1
0.2
density
0.3
0.4
Scatter plot
10
12
14
16
lo g inte ns ity
9/24/2007
43
Boxplot
9/24/2007
44
Pre-analysis
14
Background correction
12
Normalization
10
9/24/2007
45
9/24/2007
46
Normalization
Background correction
Why normalization?
Different methods:
Different methods:
Negative control
Mismatch probes
Rank-invariant gene
Quantile: giving each chip the same empirical distribution;
reducing variance w/o introducing drastic bias effects
9/24/2007
http://microarray.cancer.emory.edu/analysis/normal.vmsr
47
9/24/2007
48
Low-level analysis
9/24/2007
49
9/24/2007
50
Network analysis
Inferring and building regulatory networks.
9/24/2007
51
Clustering
9/24/2007
52
Classification
Two categories
k-nearest neighbors (KNN)
Classification Tree
Linear discriminant analysis (LDA)
Bayesian Regression
Support vector machines (SVM)
Artificial neural networks (ANN)
Hierarchical methods
Divisive: successively splitting larger clusters,
top-down
Agglomerative: successively merging smaller
clusters, bottom-up
Partitional methods
Determines all clusters at once
e.g., K-means, SOM, etc.
9/24/2007
53
9/24/2007
54
References
Pathway analysis
z1
x1
y1
9/24/2007
z2
x2
y2
zK
x3
y3
Bayesian networks
Probabilistic rational model
xM
yn-1
yn
55
9/24/2007
56
10