Documente Academic
Documente Profesional
Documente Cultură
October 21,2003
Computational Techniques in
Comparative Genomics
Outline
Fundamental concepts of comparative
genomics
Alignment and visualization tools
Information available through genome
browsers
Gene prediction and identification
Identifying regulatory sequences
Insights from Human-Mouse sequence
comparisons
Multi-species sequence analysis
1
Genome Wide Sequence Data
‘Finished’
Human
Tetraodon
Pufferfish Mouse
Zebrafish Chimpanzee
Macaque
Xenopus
2
What can be more curious than that
the hand of a man, formed for grasping, that
of a mole for digging, the leg of the horse, the
paddle of the porpoise, and the wing of the
bat, should all be constructed on the same
pattern, and should include similar bones, in
the same relative positions?
Charles Darwin, The Origin of Species (1859)
Similarity in Similarity in
Morphology Genetics
Comparative Genomics
Find sequences that have diverged less than we
expect
These sequences are likely to have a functional role
Our expectation is related to the time since the
last common ancestor
Evolutionary Distance
Human
Chimpanzee Horse
Rat
Platypus
Zebrafish
3
Comparative Genomics
Similarity in Identity
Similarity in Function
Mon nom est Elliott. Je vous présente un séminaire à propos de génomique comparative.
Meine name ist Elliott. Ich präsentiere ein Seminar über Comparativ Genomics.
Outline
Fundamental concepts of comparative
genomics
Alignment and visualization tools
Information available through genome
browsers
Gene prediction and identification
Identifying regulatory sequences
Insights from Human-Mouse sequence
comparisons
Multi-species sequence analysis
4
Sequence Alignments
100% Identical
Species 1 CATGGGCAAATTGGCCCATTGGCCATGGGGGCCCACCGTA
||||||||||||||||||||||||||||||||||||||||
Species 2 CATGGGCAAATTGGCCCATTGGCCATGGGGGCCCACCGTA
80% Identical
Species 1 CATGGGCAAATTGGCCCATTGGCCATGGGGGCCCACCGTA
|| |||| ||| ||| |||||| |||||| |||| ||||
Species 2 CACGGGCTAATCCGCCAATTGGCTATGGGG-CCCAGCGTA
30% Identical
Species 1 CATGGGCAAATTGGCCCATTGGCCATGGGGGCCCACCGTA
| | | | || | | | | |
Species 2 CACGAACTAATCCGCCAATAGCCTATAGCG-CACAGCGAA
5
PipMaker vs. VISTA
Visualization Seq1
PipMaker
http://bio.cse.psu.edu/pipmaker/
Percent Identity Plot
6
http://bio.cse.psu.edu/pipmaker/
. : . : . : . : . :
1 GACATCTAAATTGCCTATTTT ATGCCTTTATGTATTGTAGAAATCTGCC
||||:|||::| :|||||||-|||::|||:|||||| ||||||:|||||
575 GACACCTAGGTATTCTATTTTTATGTTTTTGTGTATTCTAGAAACCTGCC
50 . : . : . : . : . :
50 TTACTGTTTTGTGTAGCCACAGAACAGAAATAGACTAACTTTTTTTT
|||| ::| |:||: :: | :||||||||::||| || | ||:||---
625 TTACAACTGTATGCCATGAAGGAACAGAAGCAGAAACACATATTCTTTTA
100 . : . : . : . : . :
97 AGTAAACTCTCTGGAAACAAAAATCTTCCCAGATATTTATTGTTA
-----|:|||||:||--||:| ||||| |||||||||||||| |||| |
675 AAAAGAATAAACCCT GGGACCAAAATTCTTCCCAGATATTGATTGATC
150 . : . : . : . : . :
142 GGAAAATATAATCTAAAAATTCTTCTGCCCAACCCCTTGGCTGCATCCCA
||||||| ||||||| |||:|||||||||| ::|||||:||::||:||||
723 GGAAAATCTAATCTACAAACTCTTCTGCCCTGTCCCTTAGCCACACCCCA
200 . :
192 GTCTTCCATC
||||| ||||
773 GTCTTGCATC
PipMaker
http://bio.cse.psu.edu/pipmaker/
Percent Identity Plot
Available in 3 Flavors:
– Regular
• No Additional Options
– Advanced go to submission page
• Different Alignment Strategies
• Additional Output Summaries
– MultiPipMaker
• Multi-species display
7
MultiPipMaker
PipTools
“Show me all alignments that are at least X%
identity and Y% length”
– strong-hits
Coordinate Conversion
– transform-pos
– shift-pos
– where-hit
Coordinate Extraction from GenBank Files
– genbank2exons
– genbank2repeats
Laj – Interactive Alignment Viewer – Open Laj
8
http://www-gsd.lbl.gov/vista/
80% 88%
72% 76%
68% 56% 64%
52%
VISTA
9
What’s Your Preference?
PipMaker
VISTA
10
Outline
Fundamental concepts of comparative
genomics
Alignment and visualization tools
Information available through genome
browsers
Gene prediction and identification
Identifying regulatory sequences
Insights from Human-Mouse sequence
comparisons
Multi-species sequence analysis
Genome Browsers
http://genome.ucsc.edu
http://www.ensembl.org
http://www.ncbi.nlm.nih.gov/mapview/
11
Comparative Sequence Tracks
12
Neutral Evolution
13
Types of Neutrally Evolving DNA
Ancestral Repeats
– Ancient Relics of Transposons Inserted Prior
to the Eutherian Radiation
14
Berkley Genome Pipeline
http://pipeline.lbl.gov/
15
Outline
Fundamental concepts of comparative
genomics
Alignment and visualization tools
Information available through genome
browsers
Gene prediction and identification
Identifying regulatory sequences
Insights from Human-Mouse sequence
comparisons
Multi-species sequence analysis
16
Additional Gene Prediction Resources
SLAM – http://baboon.math.berkeley.edu/~syntenic/slam.html
– Cawley et al. (2003) Nucleic Acids Research 31:3507-3509
Box 1 from:
– Ureta-Vidal et al. (2003) Nature Reviews Genetics 4:251-262
17
Outline
Fundamental concepts of comparative
genomics
Alignment and visualization tools
Information available through genome
browsers
Gene prediction and identification
Identifying regulatory sequences
Insights from Human-Mouse sequence
comparisons
Multi-species sequence analysis
Motif Finding
Identify Over-Represented
Patterns
18
Phylogenetic Footprinting
FootPrinter – http://bio.cs.washington.edu/software.html
Takes the phylogeny into account
Coordinately Regulated
Orthologous GenesGenes
Human
Additional
Species
Identify Conserved
Motifs
19
Prioritize Non-Coding Sequence Conservation
Science (2000) 288:136-140
Outline
Fundamental concepts of comparative
genomics
Alignment and visualization tools
Information available through genome
browsers
Gene prediction and identification
Identifying regulatory sequences
Insights from Human-Mouse sequence
comparisons
Multi-species sequence analysis
20
Insights from Human-Mouse Sequence
Comparisons
Similar gene content and
linear organization
– ~340 syntenic blocks
Difference in genome size
– Mouse genome is 14% smaller
Sequence Conservation
– ~40% in Alignments
– ~5% Under Selection
• ~1.5% Protein Coding
• ~3.5% Non-Coding
Genome-Wide
Distribution
Neutrally
Evolving
Actively
Conserved
Conservation Score
21
Outline
Fundamental concepts of comparative
genomics
Alignment and visualization tools
Information available through genome
browsers
Gene prediction and identification
Identifying regulatory sequences
Insights from Human-Mouse sequence
comparisons
Multi-species sequence analysis
Phylogenetic Shadowing
Boffelli et al. (2003) Science 299:1391-1394.
22
Nature 424:788-793
http://genome.ucsc.edu/cgi-bin/hgGateway?org=Zoo
Human
Mouse
Rat
Zebrafish
Chicken
Chimpanzee
Pufferfish
Multiple
Other
Species
23
Multi-Species Weighted Conservation Score
Takes into Account the Different
Divergence Rates of Each Species
– “A Chicken Alignment Will Contribute More
Than a Baboon Alignment”
Based On the Substitution Rates at Bases
under Neutral Selection
– Calculated from 4-Fold Degenerate Positions
24
Multi-Species Conservation Score Distribution
Coding
Non-Coding Bases
0.30 Non-Coding
0.25 Multi-Species
Conserved
or Total
0.20 Sequences
Coding of
Fraction of Total Percent
0.15
0.05
0.00
-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
Conservation Score
MCSs
1.8 Mb Sequence Data
95%
Unknown
~72%
Ancient
Repeats (2.6%)
UTRs (3.8%)
5%
22% Coding
25
MCS
CFTR
Conservation
Score
Cat
Dog
Cow
Pig
Horse
Rabbit
Hedgehog
Rat
Mouse
Platypus
Opossum
Chicken
Fugu
240,000
200,000
Bases
160,000
‘False
120,000
Positives’
Total
MCS
Bases
80,000
Detected
40,000 Missed
0
70 72 74 76 78 80 82 84 86 88 90 92 94 96
Percent Identity Threshold
26
Subsets of Species Perform Better
Coding
UTRs
80,000 Ancient Repeats
Non-Coding
Reference Set MCSs
60,000
40,000
20,000
0
ALL 11 9 7 5 3 Hedgehog
12 10 8 6 4 2
27