Sunteți pe pagina 1din 9

Bioinformatics – an Outline of talk

interesting area of research


for statisticians (in Norway)? • What is bioinformatics?
• What do we need to know in biochemistry?
Mette Langaas • The Human Genome Project.
Norsk Regnesentral • Research questions in bioinformatics.
• Gene expression and data from DNA microarrays.
• Statistical methods for analysing gene expression
data.
• Bioinformatics in Norway.
• How can statisticians contribute?

Figure taken from The Human Genome Project


at http://www.ornl.gov/hgmis/ Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 1 06/03/2001 Mette Langaas 2

What is Bioinformatics ? What is Bioinformatics ?

Russ Altman (in Bioinformatics), broad definition: Bioinformatics programme at the University of Michigan
Bioinformatics is the study of how information (slightly modified):
technologies are used to solve problems in biology. Bioinformatics merges recent advances in molecular
biology and genetics with advanced statistics and
Russ Altman, (in Bioinformatics), narrow definition:
computer science technology. The goal is increased
Bioinformatics is the creation and management of understanding of the complex web of interactions
biological databases in support of genomic linking the individual components of a living cell to the
sequences. integrated behavior of the entire organism .
The BITS-journal:
Bioinformatics is a combination of Computer Science, Statisticians:
Information Technology and Genetics to determine Bioinformatics is a collection of statistical methods for
and analyse genetic information. dealing with large biological data sets.
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 3 06/03/2001 Mette Langaas 4

The Cell
What do we need to know in Molecular
Genetics and Biochemistry?

• Cell

• Chromosomes

• DNA

• Gene

• Genome
Norsk Regnesentral Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 5 06/03/2001 Mette Langaas 6

1
Human Chromosomes

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/project/info.html Norsk Regnesentral Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fgt_tspeed7.ppt Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 7 06/03/2001 Mette Langaas 8

Example of
DNA
(tertiary
structure )

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/ Norsk Regnesentral Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fgt_tspeed7.ppt Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 9 06/03/2001 Mette Langaas 10

DNA (Deoxyribonucleid acid) is a


double stranded helix, consisting of a
• Suger Phosphate backbone and
• Nitrogenous bases:
• Adenine
• Cytosine
• Guanine
• Thymine

Base pair (bp): two bases paired by


hydrogen bonds between the bases.
Adenine pairs with Thymine
Guanine pairs with Cytosine
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/
Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/f g t_tspeed7.ppt Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 11 06/03/2001 Mette Langaas 12

2
Genome -Chromosome-Gene What MORE do we need to know in Molecular
Genetics and Biochemistry?
Genome: All the genetic material in the chromosomes of a
particular organism; its size is generally given as its total
number of base pairs. [Human genome: 3.109 bp, more than 99% of • What does the gene do?
the human DNA sequences are the same across the population]
The gene encodes a specific functional product (i.e., a protein
Chromosome: The self-replicating genetic structureof cells or RNA molecule).
containing the cellular DNA that bears in its nucleotide
sequence the linear array of genes. Eukaryotic genomes consist
of a number of chromosomes whose DNA is associated with • Protein syntesis
different kinds of proteins. [Human chromosomes lenghts from 50
million to 263 million bp]
Gene: The fundamental physical and functional unit of heredity. A • mRNA
gene is an ordered sequence of nucleotides located in a
particular position on a particular chromosome that encodes a
specific functional product (i.e., a protein or RNA molecule).
[Human genes: 30 000?, average length 3000 bp]
• Amino acid

Glossary found at http://www.ornl.gov/hgmis/publicat/glossary.html Norsk Regnesentral Norsk Regnesentral


Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 13 06/03/2001 Mette Langaas 14

Protein synthesis :
Proteins are built from amino acids transcription and translation
Protein: A large moleculecomposed of one or morechains of
amino acids in a specific order; the order is determined by
the base sequence of nucleotides in the gene coding for the
protein. Proteins are required for the structure, function, and
regulation of the bodys cells, tissues, and organs,
and each protein has unique functions. Examples are
hormones, enzymes, and antibodies.

Amino acid: Any of a class of 20 molecules that are


combined to form proteins in living things. The sequence of
amino acids in a protein and hence protein function are
determined by the genetic code. [The 20 amino acids are:
alanine, arginine, aspargine, aspartic acid, cysteine, glutamic
acid, glutamine, glycine, histidine, isoleucine, leucine, lysine,
methionine, phenylalanine, proline, serine, threonine,
tryptophan, tyrosine, and valine.]
Glossary found at http://www.ornl.gov/hgmis/publicat/glossary.html Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/project/info.html
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 15 06/03/2001 Mette Langaas 16

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/ Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 17 06/03/2001 Mette Langaas 18

3
The Human Genome Project Research questions in Bioinformatics
• Begun formally in 1990 , planned to be completed in 2003. Data Management
• U.S. Human Genome Project is coordinated by the U.S . – databases, searchable, compare.
Department of Energy and the National Institutes of Health . Biological sequence alignment:
• Project goals are – compare two DNA sequences (HMM).
– to identify all the approximately 50,000(?) genes in human DNA, Pharmacogenetics:
– determine the sequences of the 3 billion chemical bases that make – how genetic differences influence the variability in patients’
up human DNA, responses to drugs.
– store this information in databases, Proteomics:
– develop faster, more efficient sequencing technologies, – which proteins are present in a cell and which proteins interact with
– develop tools for data analysis, and each other.
– address the ethical, legal, and social issues that may arise from the Structural genomics:
project. – determine the (3D) structure of the proteins encoded by a genome.
Results by now: Comparative genomics:
• Draft of entire genome (June 2000) – the function of human genes and other DNA regions are often
• 9711 mapped genes (February 4, 2001) revealed by studying their parallels in nonhumans (mice and
• New estimate: 30 000 genes (February, 2001) men...)
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 19 06/03/2001 Mette Langaas 20

Comparing human Research questions in Bioinformatics (cont’d.)


and mouse
chromosomes Transcriptomics:
– use mRNA transcripts to determine which genes are turned
on/off in a particular cell or tissue type, and how disease
changes this expression .
Functional genomics:
– experimental approaches and resources to assess gene
function
– development of software tools to handle and interpret data
e.g. from DNA microarrays .

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/ Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 21 06/03/2001 Mette Langaas 22

Functional genomics : gene expression and


Gene expression
data from DNA microarrays

• Gene expression. The process by which a gene's coded information is


• cDNA microarray experiment. converted into the structures present and operating in
• Data from one cDNA microarray experiment. the cell. Expressed genes include those that are
transcribed into mRNA and then translated into
• Data from many cDNA microarray experiments protein and those that are transcribed into RNA but
(reference design). not translated into protein (e.g., transfer and
• Applications. ribosomal RNAs).

http://www.ornl.gov/hgmis/publicat/glossary.html
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 23 06/03/2001 Mette Langaas 24

4
excitation scanning
cDNA microarray experiment
laser 2 laser 1

cDNA clones
(probes)

PCR product amplification emission


purification
printing mRNA target

overlay images and normalise

0.1nl/spot

microarray Hybridise target


to microarray

analysis
Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt Norsk Regnesentral Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/ Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 25 06/03/2001 Mette Langaas 26

The cDNA microarray experiment The cDNA microarray experiment (cont’d.)


1. Constructing the microarray (probe): 3. Hybridization and scanning:
• From a collection of purified DNA’s. A drop of each type of • The cDNA target will hybridize to spots on the array.
DNA in solution is placed on a specially prepared glass • Using a laser (different wavelengths) the fluorescent target
microscope slide by an arrayer machine. will emit light. The intensity will reflect the abundance of
2. Choosing and preparing the targets: mRNA in the original target tissue. Using a scanner two
images (red and green) is aquired.
• Select targets: the aim is to comparegene expression in
different cell populations: tissue specific, disease specific, 4. Image analysis of the microarray:
environmental , cell cycle etc. • Identifythe spots (gridding , segmentation) and assign a
• mRNA extraction: capture mRNA, amplification . intensity measurement.
• Reverse transcription to cDNA (more stable). • Relate the intensity in each spot to the background intensity
• Fluorescent labelling of cDNA targets: to identify its (local or overall) and filter out weak spots (signal -to-noise
presence. Red and green dyes (Cy3 and Cy5) are the most ratio low, label as missing ).
common.

Norsk Regnesentral Norsk Regnesentral


Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 27 06/03/2001 Mette Langaas 28

Data from many cDNA experiments


Data from one cDNA experiment
Reference design: use the same reference sample (green) for
Raw images : for each microarray each experiment (often cultivated cells). The different tissue
experiment we have two images with samples are dyed red.
measurement of fluorescent
intensities. Spots of bad quality are sample 1 sample 2 sample 3 sample n
flagged.
From image to intensities: using image analysis techniques the reference reference reference reference
spot and background pixels are determined . An intensity
measurement is assigned as the difference between spot and
(local) background. Missing values are defined as spots where From image to intensities for each experiment:
the signal is not much larger than the background noise. Ggi=green intensity for gene g at array i
Gg=green intensity for gene g Rgi=red intensityfor gene g at array i
Rg=red intensityfor gene g
Relative log-intensities from each experiment:
Relative log-intensities: there is variation in the amount of DNA
from spot to spot so intensities are only meaningful in a relative Xgi*=log 2(Rgi/Ggi)
sense. For gene g on array i the relative log -intensity (usually Median polishing: iteratively subtract the column and row
base 2) is Xg*=log 2(Rg/Gg) medians.
The data vector: {Xg*} for g=1,...,#genes.
The data matrix: {Xgi} for g=1,...,#genes and i=1,...,#arrays.
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 29 06/03/2001 Mette Langaas 30

5
DNA microarray applications Statistical methods for analysing gene
expression data
• Human disease diagnostics and treatment
– determination of predisposition and risk factors wrt. certain TASK METHOD
diseases
design the experiment experimental design
– prediction of risk factors involved using certain treatment schemes
normalize with-in array and analysis of variance
– monitor disease stage and treatment progress
between array
• Agricultural diagnostics and development find similar groups of samples clustering
– identify plant pathogens to allow suitable plant protection to be and/or genes
improved
discriminate between two or
– efficiacy and economy in plant biotechnology discrimination and classification
more groups of samples
• Analysis of food and genetically modified organisms (GMO)
classify a new sample to one
– determine the integrity of food
for many groups (or
– detect alterations and contaminations compute group probability)
– quantify GMOs
find genes that are
• Drug discovery and drug development differentially expressed (all multipletesting
samples or subsets) feature extraction
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 31 06/03/2001 Mette Langaas 32

Experimental design and ANOVA Experimental design and ANOVA (cont’d.)


[Kerr & Churchill, The Jackson Lab, August 2000]

Sources of variation for the fluorescent intensities ANOVA model:


• Variety:
• timepoints of a biological process
yijkg=µ+Ai+Dj+Vk+Gg+(VG)kg+(AG)ig+εijkg
• different types of tissue where yijkg is a transformation of Rgi or Ggi so effects are
• different treatments
additive and εijkg has F(0,σ2) or Fg(0,σg2).
• different types of disease
• Genes (hybridization efficiency)
• Dyes (two dyes – one dye consistently brighter than the other?) Replication of genes on every array: (AG)igs
• Arrays ( probing conditions)
• Dye*Variety(differences when dyeing the samples)
• Array*Gene (amount of cDNA on probes for same gene on
different arrays vary = ”spot” effects)
• Dye*Gene (are there differences in the dyes that are gene
specific?)
• Variety*Gene (EFFECT of INTEREST)
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 33 06/03/2001 Mette Langaas 34

Clustering
Experimental design and ANOVA (cont’d.)
Aim: partition genes or samples into groups so that the groups are
Reference design: homogeneous and well-separated.
• same reference variety on each array (variety not of
interest)
Data: {X gi} for g=1,...,#genes and i=1,...,#arrays.
• the most popular design.
• VG is completely confounded with DG.
Results:
• No degrees of freedom left for error estimation.
– Find groups of genes that are co-regulated
• Use when not enough tissue to dye twice. – Find subgroups (previously unknown) of diseases

Loop design: Issues:


• collects twice as much data on the varieties of interest. – feature extraction (choosing a subset of the genes)
• balanced wrt. dye, but each sample must be dyed – one-way or two -way clustering
twice. – overlapping vs. non-overlapping clusters
• (#genes-1) degrees if freedom left for error estimation. – membership in more than one cluster
• more difficult to understand for biologists. – assessing the reliability of clustering results
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 35 06/03/2001 Mette Langaas 36

6
Clustering: methods for analysing gene Molecular
expression data Portraits of
Breast
Cancer , Perou
One-way clustering: et al., Nature,
406, 6797,
– Hierarchical clustering
2000.
– Self-organizing maps (SOM) [Kohonen]
– K-means
– SVD-based (principal component) clustering

Two-way clustering:
– Block clustering
– Gene Shaving [Hastie et al. (2000)]
– Plaid Models [Lazzeroni & Owen (2000)]

Norsk Regnesentral Norsk Regnesentral


Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 37 06/03/2001 Mette Langaas 38

Discrimination and Classification Discrimination and classification:


Aim: methods for analysing gene expression data
– Discriminate between two ore more classes (e.g. normal vs.
different disease classes). Used:
– Predict the class (or probability of belonging to each class) of a new – K-nearest neighbour [Fix and Hodges (1951)]
sample.
– Support Vector Machines
Data: – CART [Breiman et al. (1984)]
– {X gi} for g=1,...,#genes and i=1,...,#arrays. – Different versions of classifying the class with the largest
– Class membership (e.g. normal, different disease classes). probability p(c). p(x|c) where p(x|c) is Gaussian with some
Results: structure on the covariance matrix (often diagonal).
– See which genes are important in discriminating between classes. – Voted Classification (bagging, boosting)
– Predictive tool. – Bayesian regression [West et al. (2000)]
Issues: Alternative methods:
– Large p (#genes) small n (#arrays, #samples). – Methods for ”large p small n” - regression (PLS, PCR, ridge
– Feature extraction or other forms of shrinkage. regression, continuum regression, etc.)
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 39 06/03/2001 Mette Langaas 40

Feature extraction and multiple testing Bioinformatics in Norway


Aim: No statisticians involved in Norway!
– Identify differentially expressed genes.

Data: Bioinformatics groups at Norwegian universities:


– {X gi} for g=1,...,#genes and i=1,...,#arrays. • UiO: Bioinformatics group, Department of Informatics, O. C.
– Possible class membership. Lingjærde and K. Liestøl
Results: – functional genomics using microarrays, clustering.
– Identify genes that can be important in discriminating between • UiB: Bioinformatics Research Group, Department of Informatics,
different classes. lead by I. Jonassen.
Methods: – analysis of biological sequences and structure
– Within the ANOVA framework testing the (VG) interation. – J-Express clustering method for gene expression data
– Compute t-statistic for each gene (difference e.g. for control and • NTNU: Knowledge Systems Group , Department of Computer
treatment group) and adjust the p-values (Bonferroni, permutation and Information Science, lead by J. Komorowski.
methods) – classification from gene expression data and apriori
Other: information
– Many ad hoc rules; differentially expressed genes are genes where
more than 3 values are outside some intervall (no class). Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 41 06/03/2001 Mette Langaas 42

7
Bioinformatics in Norway:
some academic actors Bioinformatics in Norway:
consortium on microarray technology
• UiO:
– Department of Biochemistry
• Det norske Radiumhospital (DNR):
– The Microarray Project at DNR, lead by Ola Myklebost
• Who: NTNU, UiB and DNR (UiO)
– Department of Tumor Biology • Aim:
– Department of Immunology • Establish front line competence in microarray
– Department of Genetics
bioinformatics at all participating institutions.
• The Norwegian Vetrinary College
• UiB:
• Create national data warehouse for microarray
– Department of Biochemistry and Molecular Biology based functional genomic analysis.
– Department of Oncology • Support: The Norwegian Cancer Society and NFR
• NTNU:
– Department of Physiology and Biomedical Engineering (Astrid
Lægreid)
• Agricultural University of Norway More information at http://www.med.uio.no/dnr/microarray/english.html
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 43 06/03/2001 Mette Langaas 44

Bioinformatics in Norway: Bioinformatics in Norway: NFR project


other actors
• The Norwegian Biotechnology Advisory Board Salmon Genome Project (SGP)
(Bioteknologinemnda), lead by Sissel Rogne, publication • Aim: Expand our knowledge of the biology of salmons
”Genialt”. http://www.bion.no and introduce modern genetic techniques in breeding
• EMBnet Norway, Norwegian node of network for commersial and management of Atlantic salmon. The project is
and academic bioinformatic centers. http://www.no.embnet.org
said to combine research in molecular genetics with
• Biotechnology Center (Bioteknologisenteret)
bioinformatics, with focus on genome organization
• SINTEF UniMed, MR Center, Bioinformatics group
and gene function.
• MATFORSK (fingerprinting bacteria, GMO)
• Genomar (salmon and tilapia) http://www.genomar.com • Who: The Norwegian Veterinary College, the
• Glaxo SmithKline (free offices to gene-researchers?) University of Oslo, the University of Bergen, SINTEF
• Nycomed Pharma Unimed and the Insitute of Marine Research.
• Cost: 350 MNOK

Norsk Regnesentral Norsk Regnesentral


Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 45 06/03/2001 Mette Langaas 46

Bioinformatics in Norway:
How can statisticians contribute ?
national research initiative
• Close cooperation between researchers from
FUGE Funksjonell genomforskning genetics - biochemisty - medicine - biology and
• What: National plan by NFR and the statisticians is very important!
Norwegian universities. • Communicate the need for statistical thinking in
• Aim: Bring Norway up-to-date on analysis of gene expression data
functional genome research. – Consept of noise, replication, reproduceable analyses.

• Areas: biological, medical, marine • Statistical challenges identified today:


research. – Model the entire experimental phase to arrive at optimal
experimental designs dependent on practical limitations.
• Cost: 300 MNOK each year in 5-10 – Suggestions for within-array and between-array
years (dependent on accept from normalization.
Stortinget). – Handle missing values.
– Large p small n.
More information at http://www.forskningsradet.no/fag/andre/fuge/
Norsk Regnesentral Norsk Regnesentral
Norwegian Computing Center Norwegian Computing Center
06/03/2001 Mette Langaas 47 06/03/2001 Mette Langaas 48

8
Bioinformatics – an
interesting area of research
for statisticians (in Norway)?

ü Important biological/medical problems


ü New area with exciting technologies
ü Large amounts of data
ü Statistical experience is scarce!
ü Many statistical challenges!

YES!
Norsk Regnesentral
Norwegian Computing Center
06/03/2001 Mette Langaas 49

S-ar putea să vă placă și