Sunteți pe pagina 1din 58

Genome Organization overview

Eukaryotic genomes are complex and DNA amounts and organization vary widely between species C value paradox: the amount of DNA in the haploid cell of an organism is not related to its evolutionary complexity or number of genes

C-value Paradox

Drosophila has 20X smaller genome than human and 2X fewer genes Newt and lungfish genomes ~ 5 and 50 x larger than human

Number of genes does increase in higher organisms

Re-association Kinetics

Cot (initial DNA concentration x time)

Complexity = sum length of all single copy (unique) sequence in a genome

There are different classes of eukaryotic DNA based on sequence complexity revealed by re-association kinetics

A time line of genomics research

Status of plant genome sequences as of January 2006:

The human genome


Two versions of human genome sequences were published in February 2001. DNA sequences that encode proteins make up only 5% of the genome ~50% sequences are transposable elements; clusters of gene-rich regions are separated by gene deserts CH 19 has the highest gene density, CH 13 & Y show the lowest gene density

The human genome


Gene total estimated 30,000-40,000 (now maybe 25,000), w/ an average gene size of 27 Kb Hundreds of genes share homology w/ those of bacteria The number of introns vary greatly (from 0 for histone to 234 for titin)

The human genome


Genes larger & contain more and larger introns compared to these in invertebrates (dystrophin gene is 2.5 Mb) Genes are not evenly spaced on CHs The most common genes include those: involved in nucleic acid metabolism-7.5%; receptors-5%; protein kinases-2.8% & cytoskeletal structural proteins-2.8%

The human genome predicted gene function

Any 2 human genomes are roughly 99.9% identical

Chr - chromosome n - Number of samples examined bp - Number of basepairs sequences S - Number of polymorphic sites T - Nucleotide divergence

On average ~ 0.1%

Przeworski, M., et al. (2000) Trends Genet 16, 296-302.

Yet phenotypic differences abound!

Genome organization in plants


Size of genome varies widely (100 Mb5,500 Mb) Many tandem gene duplications & larger duplications; some interchromosomal duplications also observed Large-genome plants also have genes clustered with long stretches of intergenic DNA In maize, the intergenic sequences are composed mainly of transposons

Genome Organization
gene identification Genes can be difficult to identify/predict based on genome sequence The human genome appears to contain fewer genes than originally predicted; but an estimated 35,000 genes produce an estimated 150,000 proteins

Genome Organization
gene identification No one to one correspondence between: Genome (all genes of an organism) Transcriptome (all transcripts of an organism) Proteome (all proteins of an organism

Variable estimates of human gene content

Gene identification the simple view

Gene identification the challenges

from Klug & Cummings 1997

Gene identification the challenges Non coding sequences Promoters and enhancers of gene expression can be distant from the coding region itself Genes can have alternative promoters Genes can have alternative terminators

Gene identification the challenges


Introns and exons Most eukaryotic genes have introns Introns are often much longer than exons Often many introns so mRNA much shorter than genomic DNA Intron size can vary between the same gene of different species Splice junctions are difficult to predict Alternative splicing

Gene identification the challenges


Introns and exons Eukaryotes only Removal of internal parts of the newly transcribed RNA. Takes place in the cell nucleus

Introns and exons

Introns numerous & longer than exons

Variable intron size same gene, different organism

Introns alternative splicing


Different splice patterns from the same sequence, therefore different products from the same gene.

One gene many proteins alternative splicing & 3 cleavage

Exon shuffling
Different genes having similar exons

Why genome, transcriptome and proteome dont correlate in size


More sophisticated regulation of expression Proteome vastly larger than genome
Alternate splicing, promoters, terminators (59% of genes with an average of 3 different products) RNA editing Post-translational modifications

Moonlighting
Same protein different function depending on cellular location

Gene Identification
Open reading frames Sequence conservation Database searches Synteny Sequence features CpG islands Evidence for transcription ESTs, microarrays Gene inactivation Transformation, TEs, RNAi

Gene identification - Open reading frames


5'atgcccaagctgaatagcgtagaggggttttcatcatttgag gacgatgtataa frame 1 atg ccc aag ctg aat agc gta gag ggg ttt M P K L N S V E G F tca tca ttt gag gac gat gta taa S S F E D D V * frame 2 tgc cca C P cat ttg H L

agc S agg R

tga * acg T

ata I atg M

gcg tag agg ggt ttt cat A * R G F H tat Y

Gene identification - Database searches


P L A B T S P L A B T S P L A B T S P L A B T S tC eC tC nC aC cC tC eC tC nC aC cC tC eC tC nC aC cC tC eC tC nC aC cC BF BF BF 1 BF BF BF BF BF BF 1 BF BF BF BF BF BF 1 BF BF BF BF BF BF 1 BF BF BF 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 85 93 86 87 78 82 46 49 48 65 39 77 13 09 14 51 M M M M M M A A A A A A E . M M T L S S . . . L DS NI NS NS DT DV HD HD HD HD HD HD KE PE EE EE SE QQ V I . . . I LS FE FS VS AA AD VA VA VA VA SA AA VA TS TM TM MH LH SW TY AF TF AG IA AI AL AL AL VL VL ER EN VE VE DG VP S. YS S. S. SP SP AL AL AL AL AL AL TT VQ AI AV EK VA .. DS .. .. .. S. RG RG RG RG LD SC MS ES YT FT DA VA .. LI .. .. .. .. RL RS RS RG RA RA DG SD PE EE QG VV .. LT .. .. .. .. AC AC AC AC AC AC .. .. QS QR SP AL SP ES EM EL .. .. LN LN LN LN LN LN .. .. EG DG .. QQ ES SS FG LG .. .. FA FS FA FA FA FA .. .. .. FY .. QQ RV SS SD SE .. .. DS DS DS DS DS DS .. .. .. MA .. II GN SS YE NE .. .. SW AW AW AW AW AW .. .. .. EE .. LP .. SS .P SP .. .G RL RL RL RL RM RM .. .. .. TT .. VA FS FS QG VG .. QQ P. P. R. R. LP LP .. .. .. VE .. CL DE EE GD GD RE EQ .L .I .I .I VL VL .. .. .. GV TP AP DV .E YC YC GH GH PA PA PE PE AA AA .. .. .. VP SE EF VL VI PT PM RT RT ST SS ST TT GS GS .. .. .. EE LS YM LA LA LA LA VC VS .. .. .. .. SR FG .. .. .. QM TS SS SS SN TS AS SE SE .. .. .. .. FS FG .. .. .. SK SD GD NP NP CP CP PP PP DP NS CA CA SA SP VI .. AF GF LL LL KK KK KK KK KR KR KD KD KD KD RE RE FM FV YM YM D. EL RA PA PA PA PA PA IQ IQ IQ IQ IK IK DE DE DE DE .. DE GR GR GR GR GR GR KA KA KA KA DA AA EA EA ET EW EH EH KK KK KK KK TK TK AA AA AA AA VA VA VF IF MF MF WF WF FR FR FR FR FR FH EA QA EA EA IA VA G. F. G. G. GG GG ET ET ET ET ET ET AE VE AL AL VL VI MP MP MP MP MD MD RH RH RH RH RH RH AF IF AF AF EF AF G. G. T. T. AG AG PV PI PI PI PL PL RP RS QD EA QR QR .L .L .L .L SY SY YR YR YR YR YR YR EK EE ET EK .. KQ LT LA LD LA YA YA GV GI GV GV GV GV .D VS CD SD .. II NM NM NM DM SL SL RR RK RQ RL RR RR LR GE TT TT .. PV AE AE AE AA AQ AQ R. R. R. R. RG RG R. S. T. TN .. AV GM GL GM GM GM GM DS NS NS KS RL RV .. .. .. D. .. AV LL ML LL LL LM LV GK GK GK GK GQ GQ .. .. .. .. .. VA PP PP PP PP EP AP WV WV WV WV WV WV .. .. .. .. .. LQ PP PQ PS PS PS PD CE CE SE CE CE CE .. .. .. .. .. QQ PP CA .V .V AR ER LR VR VR VR VR VR .. .. .. .. .. QV QC EM QW QW TW AR EP EP EP EP VR VP .. .. .. .. .. PV NR GD NH GH SE PE NK NK NK NK GA GI .. .. .. HG .. AV GG H. N. N. D. N. K. K. K. K. QG KG .. .. .. MN .. AV YE .. .. .. .. .. SR TR TR SR YR SR .. .. .. MA .. VA ED .. .Y .D GG GE IW IW IW IW LW LW .. .. .. SQ .. LK DV CV DG DF E. QE LG LG LG LG LG LG .. .. .. AE QR QK ES ET EG EG .Y RR TF TF TF TF TF TF .. .. .. VN .. QV NA DA DG DV SA PD PT PT QT KT TT NT .. .. .. DT .. PV .D YM .D .D VY AA AE AE AE AE AE AE VD .. TD TD .. AV VS IT VS MN TP ME MA MA MA IA MA MA DK .. HG HG PV AV LW LW LW LW LW LF AR AR AR AR AR AR MD .. LD LD VS VA SY NY SY NY N. VR

2 69

Gene identification - Synteny


Mouse-human synteny and sequence conservation A) Blocks of synteny between mouse chromosome 11 and parts of 5 different human chromosomes B) Enlarged block with perfect correspondence in order, orientation and spacing of 23 putative genes, and 245 conserved squence blocks of > 100 bp with >70% identity, many in noncoding regions Caution! Even regions of high synteny may not show perfect gene-for-gene correspondence
from Gibson & Muse (2002) A Primer of Genome Science, Sinauer Inc.

Gene identification CpG islands


Defined as regions of DNA of at least 200 bp in length that have a G+C content above 50% and a ratio of observed vs. expected CpGs close to or above 0.6 Used to help predict gene sequences, especially promoter regions.

Gene identification evidence of transcription


Sequencing libraries of cDNA clones yields expressed sequence tags ESTs (not necessarily full-length)

Genome Organization
duplicated genes Gene families
paralogs orthologs (homologs)

Pseudogenes

Duplicated genes
Paralogs = evolved one from another through gene duplication Encode closely related proteins Formed by duplication of an ancestral gene followed by mutation

Five functional genes and two pseudogenes

Pseudogenes
Nonfunctional copies of genes Formed by duplication of ancestral gene, or by reverse transcription and integration of the cDNA Not expressed due to mutations that produce a stop codon, nonsense or frameshift, or mutations that prevent mRNA transcription or processing

Duplicated genes
Can be clustered as in -globin cluster, or dispersed in genome as seen for entire globin family in humans

Duplicated genes
Paralogs vs orthologs (or homologs) Different members of the globin gene family are paralogs, having evolved one from another through gene duplication. Paralogs are separated by a gene duplication event. Each specific family member (e.g. globin human) is an ortholog (homolog) of the same family member in another species. Both evolved from an ancestral globin gene. Orthologs (homologs) are separated by a speciation event. It is not always easy to distinguish true orthologs from paralogs when comparing large multigene families between species. Especially in polyploid organisms!

Genome Organization transcripts that do not encode proteins (ncRNA)


< 5% of higher eukaryotic genome is protein coding ~97-98% of the transcriptional output of the human genome is ncRNA
Introns) Transfer RNAs (tRNA)
~ 500 tRNA genes in human genome

Ribosomal RNAs
Tandem arrays on several chromosomes 150-200 copies of 28S 5.8S 18S cluster 200-300 copies of 5S cluster

Genome Organization- ncRNA


~97-98% of the transcriptional output of the human genome is ncRNA
Small nucleolar RNAs (snoRNAs) Single genes Modify rRNAs Small nuclear RNAs (snRNAs) Spliceosomes Small regulatory RNAs Micro RNAs (miRNA) Short interfering RNAs (siRNA) Participate in transcriptional and nontranscriptional gene silencing, & regulation of translation Many come from intergenic regions recently recognized as transcribed

Genome Organization - ncRNA


97-98% of the transcriptional output of the human genome is ncRNA
Longer regulatory RNAs ncRNAs derived from introns of protein-coding genes and introns and exons of non-proteincoding genes constitute the majority of the genomic programming in higher organisms Explains why very different organisms show little difference in protein coding sequence

Genome Organization
repetitive DNA
~ 50% of human genome Moderately repeated DNA
Tandemly repeated rRNA, tRNA and histone genes (gene products needed in high amounts) Large duplicated gene families Mobile DNA Segmental duplications

Repetitive DNA - Segmental duplications


Found especially around centromeres and telomeres Often come from nonhomologous chromosomes Many can come from the same source Tend to be large (10 to 50 kb) Unique to humans?

Repetitive DNA - Segmental duplications

Repetitive DNA Transposon derived repeats


Most of the moderately repeated DNA sequences found throughout higher eukaryotic genomes (45% of human genome) Some encode enzymes that catalyze movement Long interspersed elements (LINE) retrotransposons Short interspersed elements (SINE) retrotransposons LTR (long terminal repeat) retrotransposons DNA transposons

Repetitive DNA Transposon derived repeats

Repetitive DNA Transposon derived repeats Different regions of the genome differ in density of repeats Most LINEs accumulate in AT rich regions Alu elements accumulate in GC rich regions

Genome Organization
repetitive DNA
Simple-sequence Repeats 3% of genome Highly repeated short sequences found in centromeres and telomeres Variable numbers of tandem repeats (VNTR) dispersed throughout the genome

Repetitive DNA Highly repetitive satellite DNA

Repetitive DNA VNTRs


dispersed throughout the genome 1 13 base repeat unit
microsatellite, SSR includes trinucleotide repeats in protein coding genes

14 500 repeats
minisatellites

Used as mapping and fingerprinting markers

Over view of human genome composition

S-ar putea să vă placă și