Sunteți pe pagina 1din 62

The Human Genome Project

Babak Nami
Department of Medical Genetics Sel Seluk University

Human Genome

The human genome is the genome of Homo sapiens, which is stored on 23 chromosome pairs. 22 of these are autosomal chromosome pairs, while the remaining pair is sex-determining. sexThe haploid human genome occupies a total of just over 3 billion DNA base paires. paires. The haploid human genome contains ca. 23,000 protein-coding proteingenes, genes, far fewer than had been expected before its sequencing. In fact, only about 1.5% of the genome codes for proteins, while the rest consists of non-coding RNA genes, regulatory nongenes, sequences introns, and noncoding DNA (once known as "junk introns, "junk DNA")

Human Genome

Information content of the haploid human genome by chromosome: Haploid means we only count one of each chromosome pair. For this reason, the total information content for a woman (XX) is less than for a man (XY), where both the X and the Y are counted.

How much data make up the human genome?


3 Bookcases with 40 Books per bookcase x 5000 pages per book x 5000 bases per page = 3,000,000,000 bases!

Human Genome Project

The Human Genome Project (HGP) is an international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up DNA and to identify and map the approximately 20,000 20,000 30,000 gene of the human genome from both a physical and functional standpoint.

History

1985. Proposed. 1988. 1988. Initiated and funded by NIH and US Dept. of Energy ($3 billion set aside) 1990. 1990. Work begins. 1998. 1998. Celera announces a 3-year plan to complete the 3project early 2001.Published 2001.Published in Science and Nature in February, 2002. 2002. The quest for genome sequencing was being pursued simultaneously in over 20 laboratories in six countries 2003. 2003. the whole genome sequenced

Initiative Office of HGP

History

Human Genome Project Goals and Completion Dates


2003: 3 million mapped SNPs

1998: Physical map (3,000 markers)

1994: Genetic map 1 cM resolution (3,000 markers)

2003: DNA sequence (99% of gene-containing part of sequence)

2003: 15,000 full-length cDNAs

1990

1995

2000

2005

NIH put the human genome sequence on the web July 7, 2000

Cyber geeks Searched for hidden Messages, and GATTACA

UCSC put the human genome sequence on CD in October 2000, with varying results

The Completion of the Human Genome Sequence

June 2000 White House announcement that the majority of the human genome (80%) had been sequenced (working draft). Working draft made available on the web July 2000 at genome.ucsc.edu. genome.ucsc.edu. Publication of 90 percent of the sequence in the February 2001 issue of the journal Nature. Nature. Completion of 99.99% of the genome as finished sequence on July 2003.

The first printout of the human genome to be presented as a series of books, displayed at the Wellcome Collection, London Collection,

Goals of the Project


identify all the approximately 30,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis, transfer related technologies to the private sector, and address the ethical, legal, and social issues (ELSI) that may arise from the project.

in Medicine
Improvements in diagnostic and therapeutic applications Implementation of preventative measures. Increases in gene therapy applications. applications.

In Biotechnology
Production of useful protein products for use in medicine, agriculture, bioremediation and pharmaceutical industries.
Antibiotics Protein replacement (factor VIII, TPA, streptokinase, insulin, interferon) BT insecticide toxin (from Bacillus thuringiensis) thuringiensis) Herbicide resistance (glyphosate resistance) (glyphosate Bioengineered foods] Pharm animals Pharm

In Bioinformatics
The newest, fastest growing specialty in the life sciences that integrates biotechnology and computer science. Involved in DNA sequence assembly and analysis using computercomputer-based techniques to determine gene function, regulation and control. Unknown gene sequences can be compared to databases of known genes to enable similarities to lead to determination of an unknown genes function.

Proteomics
Investigates patterns and levels of gene expression in diseased cells that can be analyzed to build databases of expression profiles.

DNA Chip Technology

In Pharmacogenomics
Investigates DNA mutations associated with disease susceptibility and drug sensitivities. Monoclonal antibodies Prodrug gene therapy for cancers

In Developmental Biology
Regulation of embryonic development. Regulation of the aging process. process. Regulation of cell division and apoptosis. Regulation of metabolism.

Evolutionary and Comparative Biologists


Because DNA mutates at a constant rate, comparisons of DNA between different organisms can provide evolutionary histories.

Human Genome Sequence Variation

Develop technologies for rapid, large-scale largeidentification and scoring of single-nucleotide singlepolymorphisms and other DNA sequence variants. Identify common variants in the coding regions of the majority of identified genes during this 5-year period. 5Create a SNP map of at least 100,000 markers. Develop the intellectual foundations for studies of sequence variation. Create public resources of DNA samples and cell lines.

Model organisms
Bacteria (E. coli, influenza, several others) (E. coli, Yeast (Saccharomyces cerevisiae) (Saccharomyces cerevisiae) Plant (Arabidopsis thaliana) (Arabidopsis thaliana) Roundworm (Caenorhabditis elegans) (Caenorhabditis elegans) Fruit fly (Drosophila melanogaster) (Drosophila melanogaster) Mouse (Mus musculus) (Mus musculus)

How does the human genome stack up?


Organism
Human (Homo sapiens) Laboratory mouse (M. musculus) Mustard weed (A. thaliana) Roundworm (C. elegans) Fruit fly (D. melanogaster) Yeast (S. cerevisiae) Bacterium (E. coli) Human immunodeficiency virus (HIV)

Genome Size (Bases)


3 billion 2.6 billion 100 million 97 million 137 million 12.1 million 4.6 million 9700

Estimated Genes
30,000 30,000 25,000 19,000 13,000 6,000 3,200 9

AAGTTC

CTAAGC

ATTCGG

AAGTTC

CTAAGC

AAGTTC

Practical Goals

A primary goal of the Human Genome Project is to make a series of descriptive diagrams maps of each human chromosome at increasingly finer resolutions. After mapping is completed, the next step is to determine the base sequence of each of the ordered DNA fragments. The ultimate goal of genome research is to find all the genes in the DNA sequence and to develop tools for using this information in the study of human biology and medicine.

Practical Goals

http://www.genome.gov/Pages/News/PaceofDiseaseGeneDiscovery.pdf

How they did it

DNA from 5 humans 2 males, 3 females Cut up DNA with restriction enzymes Ligated into BACs & YACs, then grew them up Sequenced the BACs Let a supercomputer put the pieces together

Sequencing Strategy

Once a contig map of the genome was obtained, it was necessary to sequence each individual clone. Most of the actual human genome sequencing was done on BAC clones, which are less prone to rearrangement than YAC clones. BACs are about 100-200 100kbp long. Large clones are generally sequenced by shotgun sequencing: The large cloned sequencing: DNA is randomly broken up into a series of small fragments ( less than 1 kb). These fragments are cloned and sequenced. A computer program then assembles them based on overlaps between the sequences of each clone. To ensure that every bit has been covered, you need to sequence random clones until you have covered each spot 5-10 times on 5average.

DNA

Cut segments inserted into BACs

Lots of overlap

Known sequence

Sequencing: BAC-based method BAC-

Each clone 150-200,000 bp Cloned in bacteria 20,000 BAC clones (BAC library)

clones

BAC clones mapped

Subclones 2,000 bp

subclones

Sequenced 10 times in 500 800 bp segments Subclone sequences re-assembled

Sequencing Technologies
The two basic sequencing approaches, Maxam-Gilbert and Sanger, differ primarily in the way the nested DNA fragments are produced. Maxam-Gilbert sequencing (also called the chemical degradation method) uses chemicals to cleave DNA at specific bases, resulting in fragments of different lengths. A refinement to the Maxam-Gilbert method known as multiplex sequencing enables investigators to analyze about 40 clones on a single DNA sequencing gel. Sanger sequencing (also called the chain termination or dideoxy method) involves using an enzymatic procedure to synthesize DNA chains of varying length in four different reactions, stopping the DNA replication at positions occupied by one of the four bases, and then determining the resulting fragment lengths.

Advanced Techniques
SOLiD Sequencing Helicos High speed Gene Sequencing Laser Sequencing

How the Code was Decoded

DoubleTwist Inc, an application service provider (ASP), devoted to empower life scientists, completed the first annotation of the human genome. The DoubleTwist human genome database was created using Sun Enterprise 420R and 10 K supercomputers, that is, a total of more than 350 processors. processors. It brought to a close an extensive analysis of the available HGP data that revealed genes and other valuable information. The task was accomplished using Sun Enterprise supercomputers, including Starfire servers.

Genome Map

A genome map describes the order of genes or other markers and the spacing between them on each chromosome. Human genome maps are constructed on several different scales or levels of resolution.

Genetic Map

Genetic linkage maps of each chromosome are made by determining how frequently two markers are passed together from parent to child. Because genetic material is sometimes exchanged during the production of sperm and egg cells, groups of traits (or markers) originally together on one chromosome may not be inherited together.

Physical Maps

Different types of physical maps vary in their degree of resolution. The lowestresolution physical map is the chromosomal (sometimes called cytogenetic) map, which is based on the distinctive banding patterns observed by light microscopy of stained chromosomes. A cDNA map shows the locations of expressed DNA regions (exons) on the chromosomal map. The more detailed cosmid contig map depicts the order of overlapping DNA fragments spanning the genome. A macrorestriction map describes the order and distance between enzyme cutting (cleavage) sites. The highest-resolution physical map is the complete elucidation of the DNA base-pair sequence of each chromosome in the human genome. Physical maps are described in greater detail below.

. . . to a multi-resolution view . . . multi-

. . . at the gene cluster level . . .

. . . the single gene level . . .

. . . and at the single base level

caggcggactcagtggatctggccagctgtgacttgacaag caggcggactcagtggatctagccagctgtgacttgacaag

The linkage map

The map was built by linkage studies in 60 large families with grandparents and large numbers of children, collected by the University of Utah and the Centre d'tude du Polymorphisme Humain (CEPH), Paris Families were typed with over 5000 polymorphic DNA sequences: 60% were microsatellite repeats (mostly dinucleotide (CA) repeats, also some tri- and tetratritetranucleotides). Only about 400 of them were actual genes Construction of the linkage map is a very big problem; sophisticated software was used to work out the "best fit" map of all the markers, with advanced statistical methods and algorithms

STSs and ESTs

Sequence tagged sites (STSs) are specific loci in the genome, for which enough DNA sequence is available to make PCR primers to amplify the locus (usually as a fragment of a few 100bp). These include microsatellites (e.g. CA repeats) that can be used for linkage studies. The information required to use an STS is just the sequences of the PCR primers; therefore it is very easy to make databases of STSs that can be used by anyone. No actual bits of DNA need change hands. This is crucial in allowing genome projects to proceed as international collaborations, with many laboratories participating in a co-ordinated way. coESTs act as specific tags for each human gene, since they are derived by sequencing cDNA clones which came from mRNA and therefore represent the actual transcribed sequences (as opposed to STSs, which can be derived from anywhere in the genome and are mostly non-coding). They allow rapid access nonto the actual genes, ignoring introns and junk DNA

ESTs can be 3' or 5' depending on which end of the cDNA was sequenced. Because of the methods used to make cDNA libraries, parts of the 5' end of the gene are often lost during cloning whereas the 3' end is more reliable. Therefore, the same gene may give different 5' ESTs and it will difficult to deduce whether they have come from the same gene. This shown on the diagram by the white boxes representing cDNA clones being different lengths. Another complication is due to

X-ray hybrid mapping


X-ray hybrids are made by irradiating a human cell line with 3000 rad of X-rays, fusion to hamster cells, and isolation of hybrid cell lines in Xculture A panel of 100-200 hybrids with 5-10 different fragments of human 1005DNA in each gives about 1000 fragments in total, i.e. the human genome has been divided into 1000 bits. The closer together 2 markers are in the genome, the more likely it is that they will be present in the same hybrids (since they are less likely to be separated by an X-ray induced break). XBy doing a PCR assay for each marker on all the hybrids, a map can be made. The units are called cR (centiray, where 1cR is a 1% chance that the markers will be separated by X-ray breakage). X-

For each pair of markers in turn the "co-retention frequency" is the number of hybrids in which both markers are present, divided by the number of hybrids in which one or other (or both) markers are present. On the figure, there are 5 hybrids containing both markers B and C, and 6 containing B and/or C. Therefore the co-retention frequency is 5/6 or 0.83. Likewise it is 6/7 for markers E and F, and 2/10 for markers C and E. This shows that B and C are close together, E and F are close

Clone contigs
A clone contig is a series of cloned DNA segments that overlap each other, assembled in the correct order along the genome The clones are made using vectors:

cosmids (capacity 45 kb) BACs or YACs (Bacterial or Yeast Artificial Chromosomes) which can clone 100s of kb of DNA - more suitable for dealing with large stretches of mammalian DNA.

Making a clone contig by fingerprinting

What does the draft human genome sequence tell us?


By the Numbers
The human genome contains 3 billion chemical nucleotide bases (A, C, T, and G). The average gene consists of 3000 bases, but sizes vary greatly, with the largest known human gene being dystrophin at 2.4 million bases. The total number of genes is estimated at around 30,000--much lower than previous estimates of 80,000 to 140,000. Almost all (99.9%) nucleotide bases are exactly the same in all people. The functions are unknown for over 50% of discovered genes.

What does the draft human genome sequence tell us?


How It's Arranged
The human genome's gene-dense "urban centers" are predominantly composed of the DNA building blocks G and C. In contrast, the gene-poor "deserts" are rich in the DNA building blocks A and T. GC- and AT-rich regions usually can be seen through a microscope as light and dark bands on chromosomes. Genes appear to be concentrated in random areas along the genome, with vast expanses of noncoding DNA between. Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming a barrier between the genes and the "junk DNA." These CpG islands are believed to help regulate gene activity. Chromosome 1 has the most genes (2968), and the Y chromosome has the fewest (231).

What does the draft human genome sequence tell us?


The Wheat from the Chaff
Less than 2% of the genome codes for proteins. Repeated sequences that do not code for proteins ("junk DNA") make up at least 50% of the human genome. Repetitive sequences are thought to have no direct functions, but they shed light on chromosome structure and dynamics. The human genome has a much greater portion (50%) of repeat sequences than the mustard weed (11%), the worm (7%), and the fly (3%).

What does the draft human genome sequence tell us?


How the Human Compares with Other Organisms
Unlike the human's seemingly random distribution of gene-rich areas, many other organisms' genomes are more uniform, with genes evenly spaced throughout. Humans have on average three times as many kinds of proteins as the fly or worm because of mRNA transcript "alternative splicing" and chemical modifications to the proteins. This process can yield different protein products from the same gene. Humans share most of the same protein families with worms, flies, and plants; but the number of gene family members has expanded in humans, especially in proteins involved in development and immunity. Although humans appear to have stopped accumulating repeated DNA over 50 million years ago, there seems to be no such decline in rodents. This may account for some of the fundamental differences between hominids and rodents, although gene estimates are similar in these species. Scientists have proposed many theories to explain evolutionary contrasts between humans and other organisms, including those of life span, litter sizes, inbreeding, and genetic drift.

What does the draft human genome sequence tell us?


Variations and Mutations
Scientists have identified about 3 million locations where single-base DNA differences (SNPs) occur in humans. This information promises to revolutionize the processes of finding chromosomal locations for diseaseassociated sequences and tracing human history. The ratio of germline (sperm or egg cell) mutations is 2:1 in males vs females. Researchers point to several reasons for the higher mutation rate in the male germline, including the greater number of cell divisions required for sperm formation than for eggs.

S-ar putea să vă placă și