Documente Academic
Documente Profesional
Documente Cultură
Russ Altman (in Bioinformatics), broad definition: Bioinformatics programme at the University of Michigan
Bioinformatics is the study of how information (slightly modified):
technologies are used to solve problems in biology. Bioinformatics merges recent advances in molecular
biology and genetics with advanced statistics and
Russ Altman, (in Bioinformatics), narrow definition:
computer science technology. The goal is increased
Bioinformatics is the creation and management of understanding of the complex web of interactions
biological databases in support of genomic linking the individual components of a living cell to the
sequences. integrated behavior of the entire organism .
The BITS-journal:
Bioinformatics is a combination of Computer Science, Statisticians:
Information Technology and Genetics to determine Bioinformatics is a collection of statistical methods for
and analyse genetic information. dealing with large biological data sets.
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 3 06/03/2001 Mette Langaas 4
The Cell
What do we need to know in Molecular
Genetics and Biochemistry?
• Cell
• Chromosomes
• DNA
• Gene
• Genome
Norsk Regnesentral Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 5 06/03/2001 Mette Langaas 6
1
Human Chromosomes
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/project/info.html Norsk Regnesentral Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fgt_tspeed7.ppt Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 7 06/03/2001 Mette Langaas 8
Example of
DNA
(tertiary
structure )
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/ Norsk Regnesentral Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fgt_tspeed7.ppt Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 9 06/03/2001 Mette Langaas 10
2
Genome -Chromosome-Gene What MORE do we need to know in Molecular
Genetics and Biochemistry?
Genome: All the genetic material in the chromosomes of a
particular organism; its size is generally given as its total
number of base pairs. [Human genome: 3.109 bp, more than 99% of • What does the gene do?
the human DNA sequences are the same across the population]
The gene encodes a specific functional product (i.e., a protein
Chromosome: The self-replicating genetic structureof cells or RNA molecule).
containing the cellular DNA that bears in its nucleotide
sequence the linear array of genes. Eukaryotic genomes consist
of a number of chromosomes whose DNA is associated with • Protein syntesis
different kinds of proteins. [Human chromosomes lenghts from 50
million to 263 million bp]
Gene: The fundamental physical and functional unit of heredity. A • mRNA
gene is an ordered sequence of nucleotides located in a
particular position on a particular chromosome that encodes a
specific functional product (i.e., a protein or RNA molecule).
[Human genes: 30 000?, average length 3000 bp]
• Amino acid
Protein synthesis :
Proteins are built from amino acids transcription and translation
Protein: A large moleculecomposed of one or morechains of
amino acids in a specific order; the order is determined by
the base sequence of nucleotides in the gene coding for the
protein. Proteins are required for the structure, function, and
regulation of the bodys cells, tissues, and organs,
and each protein has unique functions. Examples are
hormones, enzymes, and antibodies.
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/ Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 17 06/03/2001 Mette Langaas 18
3
The Human Genome Project Research questions in Bioinformatics
• Begun formally in 1990 , planned to be completed in 2003. Data Management
• U.S. Human Genome Project is coordinated by the U.S . – databases, searchable, compare.
Department of Energy and the National Institutes of Health . Biological sequence alignment:
• Project goals are – compare two DNA sequences (HMM).
– to identify all the approximately 50,000(?) genes in human DNA, Pharmacogenetics:
– determine the sequences of the 3 billion chemical bases that make – how genetic differences influence the variability in patients’
up human DNA, responses to drugs.
– store this information in databases, Proteomics:
– develop faster, more efficient sequencing technologies, – which proteins are present in a cell and which proteins interact with
– develop tools for data analysis, and each other.
– address the ethical, legal, and social issues that may arise from the Structural genomics:
project. – determine the (3D) structure of the proteins encoded by a genome.
Results by now: Comparative genomics:
• Draft of entire genome (June 2000) – the function of human genes and other DNA regions are often
• 9711 mapped genes (February 4, 2001) revealed by studying their parallels in nonhumans (mice and
• New estimate: 30 000 genes (February, 2001) men...)
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 19 06/03/2001 Mette Langaas 20
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/ Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 21 06/03/2001 Mette Langaas 22
http://www.ornl.gov/hgmis/publicat/glossary.html
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 23 06/03/2001 Mette Langaas 24
4
excitation scanning
cDNA microarray experiment
laser 2 laser 1
cDNA clones
(probes)
0.1nl/spot
analysis
Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt Norsk Regnesentral Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/ Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 25 06/03/2001 Mette Langaas 26
5
DNA microarray applications Statistical methods for analysing gene
expression data
• Human disease diagnostics and treatment
– determination of predisposition and risk factors wrt. certain TASK METHOD
diseases
design the experiment experimental design
– prediction of risk factors involved using certain treatment schemes
normalize with-in array and analysis of variance
– monitor disease stage and treatment progress
between array
• Agricultural diagnostics and development find similar groups of samples clustering
– identify plant pathogens to allow suitable plant protection to be and/or genes
improved
discriminate between two or
– efficiacy and economy in plant biotechnology discrimination and classification
more groups of samples
• Analysis of food and genetically modified organisms (GMO)
classify a new sample to one
– determine the integrity of food
for many groups (or
– detect alterations and contaminations compute group probability)
– quantify GMOs
find genes that are
• Drug discovery and drug development differentially expressed (all multipletesting
samples or subsets) feature extraction
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 31 06/03/2001 Mette Langaas 32
Clustering
Experimental design and ANOVA (cont’d.)
Aim: partition genes or samples into groups so that the groups are
Reference design: homogeneous and well-separated.
• same reference variety on each array (variety not of
interest)
Data: {X gi} for g=1,...,#genes and i=1,...,#arrays.
• the most popular design.
• VG is completely confounded with DG.
Results:
• No degrees of freedom left for error estimation.
– Find groups of genes that are co-regulated
• Use when not enough tissue to dye twice. – Find subgroups (previously unknown) of diseases
6
Clustering: methods for analysing gene Molecular
expression data Portraits of
Breast
Cancer , Perou
One-way clustering: et al., Nature,
406, 6797,
– Hierarchical clustering
2000.
– Self-organizing maps (SOM) [Kohonen]
– K-means
– SVD-based (principal component) clustering
Two-way clustering:
– Block clustering
– Gene Shaving [Hastie et al. (2000)]
– Plaid Models [Lazzeroni & Owen (2000)]
7
Bioinformatics in Norway:
some academic actors Bioinformatics in Norway:
consortium on microarray technology
• UiO:
– Department of Biochemistry
• Det norske Radiumhospital (DNR):
– The Microarray Project at DNR, lead by Ola Myklebost
• Who: NTNU, UiB and DNR (UiO)
– Department of Tumor Biology • Aim:
– Department of Immunology • Establish front line competence in microarray
– Department of Genetics
bioinformatics at all participating institutions.
• The Norwegian Vetrinary College
• UiB:
• Create national data warehouse for microarray
– Department of Biochemistry and Molecular Biology based functional genomic analysis.
– Department of Oncology • Support: The Norwegian Cancer Society and NFR
• NTNU:
– Department of Physiology and Biomedical Engineering (Astrid
Lægreid)
• Agricultural University of Norway More information at http://www.med.uio.no/dnr/microarray/english.html
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 43 06/03/2001 Mette Langaas 44
Bioinformatics in Norway:
How can statisticians contribute ?
national research initiative
• Close cooperation between researchers from
FUGE Funksjonell genomforskning genetics - biochemisty - medicine - biology and
• What: National plan by NFR and the statisticians is very important!
Norwegian universities. • Communicate the need for statistical thinking in
• Aim: Bring Norway up-to-date on analysis of gene expression data
functional genome research. – Consept of noise, replication, reproduceable analyses.
8
Bioinformatics – an
interesting area of research
for statisticians (in Norway)?
YES!
Norsk Regnesentral
Norwegian Computing Center
06/03/2001 Mette Langaas 49