Sunteți pe pagina 1din 8

/

lecture outline Genome sequencing programs


1. human genome project 2. hierarchical genome sequencing 3. shotgun genome sequencing 4. genome annotation

1st yr Cells and Genes Jane Langdale

5. sequencing technology

genome sizes
CL:\SSIFlC:\TIO:\ B:\SE P:\IRS

human genome project


Why . what new to learn?

F. col!
Saccharomyces .irabidopsis tbaliaua

Caeuorbabditis degalls
Drosophila IfIr/flllogflstrr

M ItS musculus Homo sapiens

Bacterium Yeast Plant Nematode Insect Mammal Mammal

4,000,000 14,000,000 100,000,000 100,000,000 165,000,000 3,000,000,000 3,500,000,000

what effect on 'real' science?

How . 'head

to tail' or 'cream skimming'? should pay?

Funding. who

chromosome maps
Scale(eM)

!~ o.J.O
:~

Scale(kb)

r c c
C A T A A C C C

whose genome?

Ro.~Ll
~)..
6

c r A c

C A T T C C C
A

C C T C A A T

(a) Karyotypic map

15 750 (b) linkage (e) Physical (d) Sequence map map map

~ c ~

physical map of chromosome


(a) Identify an ordered series of overlapping genomic clones. Chromosome -__-~Overlapping clones

hierarchical genome sequencing

42-,---;5
j I . (b) Analyze each clone for restriction sites and gene ccanons. (Restriction sites)
U l.L 1 ~A~B:;C~D~E~F~G

(Genes) I . I (c) Create maps of overlapping genomic cones.


2 3 ,-.J-W )L

construct a SAC library from genomic DNA work out which SAC clones are overlapping subclone each SAC into plasm ids sequence plasmid clones

A BC D E FG (d) Combine information into a sin~1e continuous physical map that spans the length of the chromosome.
A-J.J.l.Jj. 11 1.IiI.l U,...l.,.L..L.-.~

Contig

SAC fingerprinting

contig gaps
Sequence gap

Cloned DNA

=-

----:.

__ ~-=::----::==-- __ - ------------------

Sequenced DNA. _

--------= = -=--=- -=- -::.

:..-=-~ -=--=..-=..-=..- -=- =---

-=- -----=--=--=

- -- - - -

Cloned DNA

=-:...-__ ----

Physical gap

shotgun genome sequencing


2.lO,II"I(j50kb 1ra.gmentIMOfed ill ptasmods

the race
public
start date amount seq by 1997 method completion of draft people bp per day cost 1987

/-~
Celera
1998 0% shotgun 2001

5%
hierarchical 2001

,......-

Gap

? ?
$330 million

65
90 million $300 million

'polishing' the human genome


Iii

human genome summary


Assembly: Base Pairs: Genebuild by: Genebuild started: Genebuild released: Genebuild last updated/patched: Gene counts Known protein-coding genes: Novel protein-coding genes: Pseudogenes: RNAgenes: ImmunoglobulinfT-cell receptor gene segments: Gene exons: Gene transcripts: Other Genscan gene predictions: SNPs: GRCh37, Feb 2009 3,272,480,989 Ensembl Mar 2009 May 2009 Jul2009

Jun 2002

II H
e ;;

I - ,~ i .
I

II!=.
Jul2003

I II I III!i I II, ,1.1,.1.,


I

= ~ !~
w

, II.t"
= IIII
e
"

i ,

I In ,0
I

"

Estimated alze: 3041.74 Mb Total mapped: 2810.22 Mb (92.39%) No. of lupercontlga: 18567 In IUper contlgl > 10Mb 627.56 Mb (29 s'c1gs) In super contlga > 1 Mb 1868.00 Mb (550 s'ctgs) In super conl101 > 100Kb 2836.36 Mb (2643 .'ctg')

E.tlmated

alte:

3069.43~
Total mapped: 2843.41 Mb (92.64%) No. of lupercontlga: 350 In IUper contlga > 10Mb 2307.65 Mb (76 s'ctgs) In super contlga > 1 Mb 2789.20 Mb (199 s'ctgs) In super con1101 > 100Kb 2 842.38 Mb (332 s'ctgs)

23,438 183 12,346 6,407 122 528,281 140,426

nUll n Ii I~

43,887 17,999,182

problem ...
oaA~AU.t~~ATOCICCIX'rACAT~T~~TATTT.u.A"""~ CA.O.C"KCfM~TATT n i Ii CAn::AAAU",AAI'~AT~TArn'CIt.~AA.n TAC'M'ATC!'nTOOCCTAT'ITft~~TTACTTGATn'TTr~"'Tf~TTM'f'TC.V..I.AT~1Vf """T'T'CTAT'I'.llCTCTATTOQT'u.TCT,t.MTCTAoV.1'G.\Qr'I;A~'lTT'!'CTTT't"'~TTTAG1"rC"fCA CTGCMGTCTHA~T ~.o.c.v...u.nct"'f'CTG#uUTTTAGo\T'!'CCA.\A~T"'TAC!'A~t'TTo\C'f'CTfC'M'CCA1'TTT ~"ACT'I'AMQ1'CCTAGC'tCCIt.TC~ATA.AAT"'T'CATCTTA~TAT<Jf'COClT~ CQT"1'G~TQI'TATM>TQI.:'TAC'rCfTAGATTCTTACAt'TCAT"'TCA.\CA'f'C..toCAAT ccc.u.c.v.TN:TN:.AT"'CAT"'TAT'CCMCT"'G'l'CA AATACTCTATMJoAJoN:'tMCfAAACAATTCAACAaACIoQCATAItAAJtG.MA'l"r'TOG'rAGTCTA~'M'I'T'OIOCTfACAAqr.u.AAaA'f'C'M"IaAACATA ~TAAJ.CtMTAATOCATA~ACCAAAACAAAAA./MTAATGGATATATACTATAACA~CACTTC""'C""'C'" AAAAAAC'n'T,l,l;AAATA'l'TAAAATTATCACcT'T'ex.C' 'AC' V" 'TA1'CAAAMTATAI:OAAAACAAl:.CACTACATAMTAGTAOGT'COGCCATOGC!'C'G GACMAATAtOA1'TACM'AIICM'I'T~TTAQC'TTTCOQT.TA.Qf.u.c.u.ccl.A'l'CM~TrTItIXO CCo\COTOOCATT.~.T't(aA~TCC'T'f'C.oUIC'~'fTf'CVTTAliCQtTClCTACGOCCCTCTCTCGC'TAMC.IoCA~ oarAQl'.TCT.u.AC~1'CCICCA1'C'TCA'tCA~TCn'Co\ACTTAA!'~~M!'T.u.c CCTf'l"t'ATTMTU.OCCA'U.I.~AT.TAAOLL;iiiCi '''''''''''TCMA~tcA.~ ~~~~,M('ac"'CCCOOc"l'~~ ~lIICOT'C"'CG' OCiiOCiCIOQCI~~~ACCCA~~CG!'OClCOOCC 1OA1"C.I!IXOGAGACAQCT.~TCAAOGI&A.~~'~.TQQCTTM:TACIC'I'CCT C#.AttACTCT~~.nT.T.t.T~TG.AUQtAt'CTAQCOTn~'l'C'TGMnT~~ ctA'I"I'TOGA.~~TATAct1'~T~TAGA~~ 'tTCTACOC.'CClT~~ACAAGAT'I'OQCTAC'!"CCCI.A.C.O'I'CT'CCQGACM~ ~MQCCACCaM'G'l'ACn'C.UroCOCCOCCl'T e nc<:tC'TA~ATT~OTTC ~CAQCV,ATAc:.v..tw\T'CTACo\CNiTAT.'.'ATA!'ATAT"".UTA'.TATATATA'l'GATTAT.TACATACATOGOQCAOCT ACCAtT.TACATAr!'~ACCnAtTCTM'ATTA~.....aM'Co\~."U.OQ.T"l"f'C'I'T'CTQT~T ATIPC.\AOOCTM'TCCIICCACT'ITACAA'fCTtG'f'CT'I'IOOCCA~~AT~~.v.CG1TC'I'TCATTIIC1'O!'OCAC CCXlC'I'OCT~ATo:aTCT""TT.~~TCTCaA.\'JOCi'CiC'Giiii;Lwv.TAOO'ft'CAC ~TA~~~~~IiICA.oUwUICM'tM'I'G'i'OC

where are the data?


- GenBank (National Center for Biotechnology Information, NCBI) - Nucleotide Sequence Database (European Molecular Biology Laboratory, EMBL) - DNA Databank of Japan

.~~~~~

OCQQC!'T~""~'f'QnT~AGlCMnTCn"TAT IOAM~AArT'IAC'!'GAA~M~ACCTTATAt'OI"I"f'CC'U~AM .T...::u.f"CA.t.AAT.TA!'.'AGTTT'f'IOftMTT.Tf'!'C'f'CCTU'ATA'IIICT'I'TTTTMTTT'fCTCCT.'.ftTIoQ!"Tn'ft'!MTC'fTC"tTCTATTTTTT'I'T'I' .. 'l"t'CftT'iCA,TT! .. UTV'f'NACTTACAA.t.CTT!AAAn' .TA'l'TCMTAMT.nT.rc ..u ... UVf'C.TAM'fTTGTAT'fTCU"TTI'A'fIXMTTQC;;T .. C!"i"CT'rCA~.IoCA"n'ATT'GACTCAC'f~ .1TAAC.U.AQf.u.l\GCATAAAJ'GOCTAIIIOCt~TATTTTI.1:A.CA1'CT'I'ATT'C ~TCTA~tc'fCTTCT""""'TT!AGM'N::AAMTTATCiMiiiCOCOCX:CiiC iiiliCi~'I'CA~ATA'I'I'n'I'A t'TCCTCACAt'TCAAQ(aG1Cca.AOCCGA'IOCaCCAAC~TCMTAGT"1"'M'GCA'~1;A'M'At'OCA'MT'TOCAATMAA'tCATCCt.T GA.U.AA~ACTA~TMTAT~ATOOC"I'CACOCCATAoCTTACA'iTM'CATCGATAACAT e<:'TCCOCAAA.'!'GOCC'L'GC(.'~T'OCAO!'T'TAT~TAG~AGAAA'ITCOCTrTCTAT'ft'T

annotation
All sequences in the database are annotated: origin - species, tissue, cell line, clone background information - literature, researcher important regions of the sequence - promoter, introns, coding sequence, motifs links to protein sequence, and other information BUT: dependent on researchers for entry so can contain errors

how to annotate
identify open reading frames > 100 codons search amino acid and nucleotide databases with ORFs identify repeat sequences identify known targetting sequences

how to find related genes


j2asic bocal 61ignment 1001 (BLAST)
http:Jtwww.ncbi.nlm.nih.gov:80IBLAST/ compares DNA and protein sequences - different varieties:

BLASTP

= amino acid query sequence = nucleotide


query sequence

against a protein sequence

database

BLASTN

against

a nucleotide

database

BLASTX

= nucleotide
against

query sequence a protein sequence

translated database

in all 6 reading

frames

TBLASTN

= protein query sequence


dynamically translated

against a nucleotide in all reading frames

sequence

database

TBLASTX

= 6-frame translations
6-frame translations

of a nucleotide of a nucleotide

query sequence sequence

against

the

database

how to determine gene function


"~
[ P,otcinOxprOSS:01l Tw-hyUlid screen" ,

automated dideoxy sequencing

-G CAT

"ff!~
i~

I~

:~ i;:::::::::>

I~ I~>

new generation sequencing


, ,
Chemistry I Amplification lRoche(454) Pyrosequencing Emulsion PCR IIlumina Polymerase-based [ Bridge amplification 1300 Mb 4 days 32-40 bp $8950 S5.97

'454' sequencing

SOUD
Ligation-based Emulsion peR 3000 Mb 5 days 35 bp 517 447 55.81
/

100 Mb Mbfrun 7h TimefrunD Read length 250 bp Cost per run /$8439 Cost per Mb /$84.39

11"~~""""1'"

, . "

17
'lIIumina' sequencing
LDNA

'SOLid' sequencing
rs-

(ug)

, , I -.' I

,
V
!'

----li

"",,./ "",nI

"",..I """

,III, --,"-III

'.
e ,

IIi

'Sample preparation Cluster growth

\
;- -"":.---:.----=----:..
u It.'

r. ';' c:

'1' ~.;. l'

~ '; ,;. - ~,

"

i_._.H
Sequencing Basecalling

Imageacquisition

SUMMARY
genome sequence programs can adopt either a hierarchical or random shotgun approach random shotgun sequences often have gaps and < 2 x coverage finished sequences provide telomere to telomere sequence for each chromosome genome annotation Is dependent on blolnformatlcs second generation sequencing technology revolutionized the speed with which new sequence data can be obtained

essential reading
Micklos et al. (2002)
DNA Science: A first course

Chapters 6 CSHL Press

see also

http://www.sanger.ac. uklHG P

S-ar putea să vă placă și