Genome

Genome Organization overview
Eukaryotic genomes are complex and DNA amounts and organization vary widely between species C value paradox: the amount of DNA in the haploid cell of an organism is not related to its evolutionary complexity or number of genes
C-value Paradox
Drosophila has 20X smaller genome than human and 2X fewer genes Newt and lungfish genomes ~ 5 and 50 x larger than human
Number of genes does increase in higher organisms
Re-association Kinetics
Cot (initial DNA concentration x time)
Complexity = sum length of all single copy (unique) sequence in a genome
There are different classes of eukaryotic DNA based on sequence complexity revealed by re-association kinetics
A time line of genomics research
Status of plant genome sequences as of January 2006:
The human genome

Two versions of human genome sequences were published in February 2001. DNA sequences that encode proteins make up only 5% of the genome ~50% sequences are transposable elements; clusters of gene-rich regions are separated by gene deserts CH 19 has the highest gene density, CH 13 & Y show the lowest gene density
The human genome

Gene total estimated 30,000-40,000 (now maybe 25,000), w/ an average gene size of 27 Kb Hundreds of genes share homology w/ those of bacteria The number of introns vary greatly (from 0 for histone to 234 for titin)
The human genome

Genes larger & contain more and larger introns compared to these in invertebrates (dystrophin gene is 2.5 Mb) Genes are not evenly spaced on CHs The most common genes include those: involved in nucleic acid metabolism-7.5%; receptors-5%; protein kinases-2.8% & cytoskeletal structural proteins-2.8%
The human genome predicted gene function
Any 2 human genomes are roughly 99.9% identical
Chr - chromosome n - Number of samples examined bp - Number of basepairs sequences S - Number of polymorphic sites T - Nucleotide divergence
On average ~ 0.1%
Przeworski, M., et al. (2000) Trends Genet 16, 296-302.
Yet phenotypic differences abound!
Genome organization in plants

Size of genome varies widely (100 Mb5,500 Mb) Many tandem gene duplications & larger duplications; some interchromosomal duplications also observed Large-genome plants also have genes clustered with long stretches of intergenic DNA In maize, the intergenic sequences are composed mainly of transposons
Genome Organization
gene identification Genes can be difficult to identify/predict based on genome sequence The human genome appears to contain fewer genes than originally predicted; but an estimated 35,000 genes produce an estimated 150,000 proteins
Genome Organization
gene identification No one to one correspondence between: Genome (all genes of an organism) Transcriptome (all transcripts of an organism) Proteome (all proteins of an organism
Variable estimates of human gene content
Gene identification the simple view
Gene identification the challenges
from Klug & Cummings 1997
Gene identification the challenges Non coding sequences Promoters and enhancers of gene expression can be distant from the coding region itself Genes can have alternative promoters Genes can have alternative terminators

Introns and exons Most eukaryotic genes have introns Introns are often much longer than exons Often many introns so mRNA much shorter than genomic DNA Intron size can vary between the same gene of different species Splice junctions are difficult to predict Alternative splicing

Introns and exons Eukaryotes only Removal of internal parts of the newly transcribed RNA. Takes place in the cell nucleus
Introns and exons
Introns numerous & longer than exons
Variable intron size same gene, different organism
Introns alternative splicing

Different splice patterns from the same sequence, therefore different products from the same gene.
One gene many proteins alternative splicing & 3 cleavage
Exon shuffling
Different genes having similar exons
Why genome, transcriptome and proteome dont correlate in size

More sophisticated regulation of expression Proteome vastly larger than genome
Alternate splicing, promoters, terminators (59% of genes with an average of 3 different products) RNA editing Post-translational modifications
Moonlighting
Same protein different function depending on cellular location
Gene Identification
Open reading frames Sequence conservation Database searches Synteny Sequence features CpG islands Evidence for transcription ESTs, microarrays Gene inactivation Transformation, TEs, RNAi
Gene identification - Open reading frames

5'atgcccaagctgaatagcgtagaggggttttcatcatttgag gacgatgtataa frame 1 atg ccc aag ctg aat agc gta gag ggg ttt M P K L N S V E G F tca tca ttt gag gac gat gta taa S S F E D D V * frame 2 tgc cca C P cat ttg H L
agc S agg R
tga * acg T
ata I atg M
gcg tag agg ggt ttt cat A * R G F H tat Y
Gene identification - Database searches

P L A B T S P L A B T S P L A B T S P L A B T S tC eC tC nC aC cC tC eC tC nC aC cC tC eC tC nC aC cC tC eC tC nC aC cC BF BF BF 1 BF BF BF BF BF BF 1 BF BF BF BF BF BF 1 BF BF BF BF BF BF 1 BF BF BF 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 85 93 86 87 78 82 46 49 48 65 39 77 13 09 14 51 M M M M M M A A A A A A E . M M T L S S . . . L DS NI NS NS DT DV HD HD HD HD HD HD KE PE EE EE SE QQ V I . . . I LS FE FS VS AA AD VA VA VA VA SA AA VA TS TM TM MH LH SW TY AF TF AG IA AI AL AL AL VL VL ER EN VE VE DG VP S. YS S. S. SP SP AL AL AL AL AL AL TT VQ AI AV EK VA .. DS .. .. .. S. RG RG RG RG LD SC MS ES YT FT DA VA .. LI .. .. .. .. RL RS RS RG RA RA DG SD PE EE QG VV .. LT .. .. .. .. AC AC AC AC AC AC .. .. QS QR SP AL SP ES EM EL .. .. LN LN LN LN LN LN .. .. EG DG .. QQ ES SS FG LG .. .. FA FS FA FA FA FA .. .. .. FY .. QQ RV SS SD SE .. .. DS DS DS DS DS DS .. .. .. MA .. II GN SS YE NE .. .. SW AW AW AW AW AW .. .. .. EE .. LP .. SS .P SP .. .G RL RL RL RL RM RM .. .. .. TT .. VA FS FS QG VG .. QQ P. P. R. R. LP LP .. .. .. VE .. CL DE EE GD GD RE EQ .L .I .I .I VL VL .. .. .. GV TP AP DV .E YC YC GH GH PA PA PE PE AA AA .. .. .. VP SE EF VL VI PT PM RT RT ST SS ST TT GS GS .. .. .. EE LS YM LA LA LA LA VC VS .. .. .. .. SR FG .. .. .. QM TS SS SS SN TS AS SE SE .. .. .. .. FS FG .. .. .. SK SD GD NP NP CP CP PP PP DP NS CA CA SA SP VI .. AF GF LL LL KK KK KK KK KR KR KD KD KD KD RE RE FM FV YM YM D. EL RA PA PA PA PA PA IQ IQ IQ IQ IK IK DE DE DE DE .. DE GR GR GR GR GR GR KA KA KA KA DA AA EA EA ET EW EH EH KK KK KK KK TK TK AA AA AA AA VA VA VF IF MF MF WF WF FR FR FR FR FR FH EA QA EA EA IA VA G. F. G. G. GG GG ET ET ET ET ET ET AE VE AL AL VL VI MP MP MP MP MD MD RH RH RH RH RH RH AF IF AF AF EF AF G. G. T. T. AG AG PV PI PI PI PL PL RP RS QD EA QR QR .L .L .L .L SY SY YR YR YR YR YR YR EK EE ET EK .. KQ LT LA LD LA YA YA GV GI GV GV GV GV .D VS CD SD .. II NM NM NM DM SL SL RR RK RQ RL RR RR LR GE TT TT .. PV AE AE AE AA AQ AQ R. R. R. R. RG RG R. S. T. TN .. AV GM GL GM GM GM GM DS NS NS KS RL RV .. .. .. D. .. AV LL ML LL LL LM LV GK GK GK GK GQ GQ .. .. .. .. .. VA PP PP PP PP EP AP WV WV WV WV WV WV .. .. .. .. .. LQ PP PQ PS PS PS PD CE CE SE CE CE CE .. .. .. .. .. QQ PP CA .V .V AR ER LR VR VR VR VR VR .. .. .. .. .. QV QC EM QW QW TW AR EP EP EP EP VR VP .. .. .. .. .. PV NR GD NH GH SE PE NK NK NK NK GA GI .. .. .. HG .. AV GG H. N. N. D. N. K. K. K. K. QG KG .. .. .. MN .. AV YE .. .. .. .. .. SR TR TR SR YR SR .. .. .. MA .. VA ED .. .Y .D GG GE IW IW IW IW LW LW .. .. .. SQ .. LK DV CV DG DF E. QE LG LG LG LG LG LG .. .. .. AE QR QK ES ET EG EG .Y RR TF TF TF TF TF TF .. .. .. VN .. QV NA DA DG DV SA PD PT PT QT KT TT NT .. .. .. DT .. PV .D YM .D .D VY AA AE AE AE AE AE AE VD .. TD TD .. AV VS IT VS MN TP ME MA MA MA IA MA MA DK .. HG HG PV AV LW LW LW LW LW LF AR AR AR AR AR AR MD .. LD LD VS VA SY NY SY NY N. VR
2 69
Gene identification - Synteny

Mouse-human synteny and sequence conservation A) Blocks of synteny between mouse chromosome 11 and parts of 5 different human chromosomes B) Enlarged block with perfect correspondence in order, orientation and spacing of 23 putative genes, and 245 conserved squence blocks of > 100 bp with >70% identity, many in noncoding regions Caution! Even regions of high synteny may not show perfect gene-for-gene correspondence
from Gibson & Muse (2002) A Primer of Genome Science, Sinauer Inc.
Gene identification CpG islands

Defined as regions of DNA of at least 200 bp in length that have a G+C content above 50% and a ratio of observed vs. expected CpGs close to or above 0.6 Used to help predict gene sequences, especially promoter regions.
Gene identification evidence of transcription

Sequencing libraries of cDNA clones yields expressed sequence tags ESTs (not necessarily full-length)
Genome Organization
duplicated genes Gene families
paralogs orthologs (homologs)
Pseudogenes
Duplicated genes
Paralogs = evolved one from another through gene duplication Encode closely related proteins Formed by duplication of an ancestral gene followed by mutation
Five functional genes and two pseudogenes
Pseudogenes
Nonfunctional copies of genes Formed by duplication of ancestral gene, or by reverse transcription and integration of the cDNA Not expressed due to mutations that produce a stop codon, nonsense or frameshift, or mutations that prevent mRNA transcription or processing
Duplicated genes
Can be clustered as in -globin cluster, or dispersed in genome as seen for entire globin family in humans
Duplicated genes
Paralogs vs orthologs (or homologs) Different members of the globin gene family are paralogs, having evolved one from another through gene duplication. Paralogs are separated by a gene duplication event. Each specific family member (e.g. globin human) is an ortholog (homolog) of the same family member in another species. Both evolved from an ancestral globin gene. Orthologs (homologs) are separated by a speciation event. It is not always easy to distinguish true orthologs from paralogs when comparing large multigene families between species. Especially in polyploid organisms!
Genome Organization transcripts that do not encode proteins (ncRNA)

< 5% of higher eukaryotic genome is protein coding ~97-98% of the transcriptional output of the human genome is ncRNA
Introns) Transfer RNAs (tRNA)
~ 500 tRNA genes in human genome
Ribosomal RNAs
Tandem arrays on several chromosomes 150-200 copies of 28S 5.8S 18S cluster 200-300 copies of 5S cluster
Genome Organization- ncRNA

~97-98% of the transcriptional output of the human genome is ncRNA
Small nucleolar RNAs (snoRNAs) Single genes Modify rRNAs Small nuclear RNAs (snRNAs) Spliceosomes Small regulatory RNAs Micro RNAs (miRNA) Short interfering RNAs (siRNA) Participate in transcriptional and nontranscriptional gene silencing, & regulation of translation Many come from intergenic regions recently recognized as transcribed
Genome Organization - ncRNA

97-98% of the transcriptional output of the human genome is ncRNA
Longer regulatory RNAs ncRNAs derived from introns of protein-coding genes and introns and exons of non-proteincoding genes constitute the majority of the genomic programming in higher organisms Explains why very different organisms show little difference in protein coding sequence
Genome Organization
repetitive DNA
~ 50% of human genome Moderately repeated DNA
Tandemly repeated rRNA, tRNA and histone genes (gene products needed in high amounts) Large duplicated gene families Mobile DNA Segmental duplications
Repetitive DNA - Segmental duplications

Found especially around centromeres and telomeres Often come from nonhomologous chromosomes Many can come from the same source Tend to be large (10 to 50 kb) Unique to humans?
Repetitive DNA - Segmental duplications
Repetitive DNA Transposon derived repeats

Most of the moderately repeated DNA sequences found throughout higher eukaryotic genomes (45% of human genome) Some encode enzymes that catalyze movement Long interspersed elements (LINE) retrotransposons Short interspersed elements (SINE) retrotransposons LTR (long terminal repeat) retrotransposons DNA transposons
Repetitive DNA Transposon derived repeats
Repetitive DNA Transposon derived repeats Different regions of the genome differ in density of repeats Most LINEs accumulate in AT rich regions Alu elements accumulate in GC rich regions
Genome Organization
repetitive DNA
Simple-sequence Repeats 3% of genome Highly repeated short sequences found in centromeres and telomeres Variable numbers of tandem repeats (VNTR) dispersed throughout the genome
Repetitive DNA Highly repetitive satellite DNA
Repetitive DNA VNTRs

dispersed throughout the genome 1 13 base repeat unit
microsatellite, SSR includes trinucleotide repeats in protein coding genes
14 500 repeats
minisatellites
Used as mapping and fingerprinting markers
Over view of human genome composition

Genome

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Genome

Încărcat de

Drepturi de autor:

Formate disponibile

Genome Organization overview

Number of genes does increase in higher organisms

Cot (initial DNA concentration x time)

Complexity = sum length of all single copy (unique) sequence in a genome

A time line of genomics research

Status of plant genome sequences as of January 2006:

The human genome

The human genome

The human genome

The human genome predicted gene function

Any 2 human genomes are roughly 99.9% identical

Przeworski, M., et al. (2000) Trends Genet 16, 296-302.

Yet phenotypic differences abound!

Genome organization in plants

Variable estimates of human gene content

Gene identification the simple view

Gene identification the challenges

from Klug & Cummings 1997

Gene identification the challenges

Gene identification the challenges

Introns and exons

Introns numerous & longer than exons

Variable intron size same gene, different organism

Introns alternative splicing

One gene many proteins alternative splicing & 3 cleavage

Why genome, transcriptome and proteome dont correlate in size

Gene identification - Open reading frames

gcg tag agg ggt ttt cat A * R G F H tat Y

Gene identification - Database searches

Gene identification - Synteny

Gene identification CpG islands

Gene identification evidence of transcription

Five functional genes and two pseudogenes

Genome Organization transcripts that do not encode proteins (ncRNA)

Genome Organization- ncRNA

Genome Organization - ncRNA

Repetitive DNA - Segmental duplications

Repetitive DNA - Segmental duplications

Repetitive DNA Transposon derived repeats

Repetitive DNA Transposon derived repeats

Repetitive DNA Highly repetitive satellite DNA

Repetitive DNA VNTRs

Used as mapping and fingerprinting markers

Over view of human genome composition

S-ar putea să vă placă și