Sunteți pe pagina 1din 19

Cairo University

Faculty of Computers and Information


Information Systems Department
Bioinformatics Survey

Microarray time series


classification

Prepared by:

Mohamed Mahmoud Mahmoud


Mostafa Lamlom Ahmed

Hani Hussein Hassan Ali

Supervised by:

Dr. Hoda Mokhtar

Survey
2011
List of Figures

Figure 1.1………………………………………………………………………………….........6
Figure 1.2 ………………………………………………………………………………………6
Figure 2.1……………………………………………………………………………………….9
Figure 3.1……………………………………………………………………………………….14
Figure 3.2……………………………………………………………………………………….15

2 Microarray Time Series Classification


Table of Contents

Chapter 1: Introduction
1.1 What is DNA microarray technology?…………………………………………4
1.2 What is DNA microarray technology used for?...................................................5
1.3 How does DNA microarray technology works? ………………………………6
1.4 Microarray types and usage…………………………………………………….8

Chapter 2: Microarray methods


2.1 Independent Component Analysis (ICA)……… ………………………………9
2.2 ICA models of gene expression data …………….…………................................9
2.3 Gene selection…………………………………….............................................. 10
2.4 Classifiers……………………………………………………………………… 10

Chapter 3: Data mining for DNA microarray


3.1 Gene Selection ……………….……………………………………………………11
3.2 Pattern Classifier …………………………………………………………………12
3.2.1 MLP………………………………………………………………………..12
3.2.2 KNN………………………………………………………………………..12
3.2.3 SVM………………………………………………………………………..13
3.2.4 Ensemble…………………………………………………………………..13

Chapter 4: Classification Algorithms


4.1 Wavelet Approach……………………………………………………………………………….14

Chapter 5: Conclusion ......................................................................….............17

Chapter 6: Future Work ......................................................................…..........17

Chapter 7: References ......................................................................….............18

3 Microarray Time Series Classification


Chapter 1: Introduction

Introduction

In [1], Microarray technology has supplied a large volume of data, which changes many problems
in biology into the problems of computing. As a result techniques for extracting useful information
from the data are developed. In particular, microarray technology has been applied to prediction
and diagnosis of cancer, so that it expectedly helps us to exactly predict and diagnose cancer. To
precisely classify cancer we have to select genes related to cancer because the genes extracted
from microarray have many noises. In this paper, we attempt to explore seven feature selection
methods and four classifiers and propose ensemble classifiers

In three benchmark datasets to systematically evaluate the performances of the feature selection
methods and machine learning classifiers. Three benchmark datasets are leukemia cancer dataset,
colon cancer dataset and lymphoma cancer data set. The methods to combine the classifiers are
majority voting, weighted voting, and Bayesian approach to improve the performance of
classification. Experimental results show that the ensemble with several basis classifiers produces
the best recognition rate on the benchmark datasets.

In [5], Microarray technologies facilitate the generation of vast amount of bio-signal or genomic
signal data. The major challenge in processing these signals is the extraction of the global
characteristics of the data due to their huge dimension and the complex relationship among
various genes

1.1What is DNA microarray technology?

In [1], although all of the cells in the human body contain identical genetic material, the same
genes are not active in every cell. Studying which genes are active and which are inactive in
different cell types helps scientists to understand both how these cells function normally and how
they are affected when various genes do not perform properly. In the past, scientists have only
been able to conduct these genetic analyses on a few genes at once. With the development of DNA
microarray technology, however, scientists can now examine how active thousands of genes are at
any given time.

In[2],DNA microarray technology has attracted tremendous interest in both the scientific
community and industry. Generally, microarray expression experiments allow the recording of

4 Microarray Time Series Classification


expression levels of thousands of genes simultaneously. These experiments primarily consist of
either monitoring each gene multiple times under various conditions

In [3], recently microarray data have brought much attention to bioinformatics research area.
Microarray data are a recording of expression levels of thousands of genes that are measured in
various experimental settings. Microarray data have typically several thousands of genes (features)
but only tens of experiments (samples), referred to as a small-sample-sized-problem.

In [6], DNA microarray technology allows simultaneous monitoring and measuring of thousands
of gene expression activation levels in a single experiment. This technology is currently used in
medical diagnosis and gene analysis. Many microarray research projects focus on clustering
analysis and classification accuracy. In clustering analysis, the purpose of clustering is to analyze
the gene groups that show a correlated pattern of the gene expression data and provide insight into
gene interactions and function. Research on classification accuracy is aimed at building an
efficient model for predicting the class membership of data, produce a correct label on training
data, and predict the label for any unknown data correctly.

In [8], the analysis of microarray data requires two steps: feature selection and classification. From
a variety of feature selection methods and classifiers, it is difficult to find optimal ensembles
composed of any feature-classifier pairs. This paper proposes a novel method based on the
evolutionary algorithm (EA) to form sophisticated ensembles of features and classifiers that can be
used to obtain high classification performance. In spite of the exponential number of possible
ensembles of individual feature-classifier pairs, an EA can produce the best ensemble in a
reasonable amount of time. The chromosome is encoded with real values to decide the weight for
each feature-classifier pair in an ensemble. Experimental results with two well-known microarray
datasets in terms of time and classification rate indicate that the proposed method produces
ensembles that are superior to individual classifiers, as well as other ensembles optimized by
random and greedy strategies.

1.2What is DNA microarray technology used for?

In[1],Microarray technology will help researchers to learn more about many different diseases,
including heart disease, mental illness and infectious diseases, to name only a few. One intense
area of microarray research at the National Institutes of Health (NIH) is the study of cancer. In the
past, scientists have classified different types of cancers based on the organs in which the tumors
develop. With the help of microarray technology, however, they will be able to further classify
these types of cancers based on the patterns of gene activity in the tumor cells. Researchers will
then be able to design treatment strategies targeted directly to each specific type of cancer.
Additionally, by examining the differences in gene activity between untreated and treated tumor
cells - for example those that are radiated or oxygen-starved - scientists will understand exactly
how different therapies Affect tumors and be able to develop more effective treatments.

5 Microarray Time Series Classification


Figure 1.1

Figure 1.2

6 Microarray Time Series Classification


1.3 How does DNA microarray technology work?

In [1], DNA microarrays are created by robotic machines that arrange minuscule amounts of hundreds or
thousands of gene sequences on a single microscope slide. Researchers have a database of over 3.5
million genetic sequences that they can use for this purpose. When a gene is activated, cellular
machinery begins to copy certain segments of that gene. The resulting product is known as messenger
RNA (mRNA), which is the body's template for creating proteins. The mRNA produced by the cell is
complementary, and therefore will bind to the original portion of the DNA strand from which it was
copied.

In [1], to determine which genes are turned on and which are turned off in a given cell, a researcher must
first collect the messenger RNA molecules present in that cell. The researcher then labels each mRNA
molecule by using a reverse transcriptase enzyme (RT) that generates a complementary cDNA to the
mRNA. During that process fluorescent nucleotides are attached to the cDNA. The tumor and the
normal samples are labeled with different fluorescent dyes. Next, the researcher places the labeled
cDNAs onto a DNA microarray slide. The labeled cDNAs that represent mRNAs in the cell will then
hybridize – or bind – to their synthetic complementary DNAs attached on the microarray slide, leaving
its fluorescent tag.  A researcher must then use a special scanner to measure the fluorescent intensity for
each spot/areas on the microarray slide.

In [1], if a particular gene is very active, it produces many molecules of messenger RNA, thus, more
labeled cDNAs, which hybridize to the DNA on the microarray slide and generate a very bright
fluorescent area. Genes that are somewhat less active produce fewer mRNAs, thus, less labeled cDNAs,
which results in dimmer fluorescent spots. If there is no fluorescence, none of the messenger molecules
have hybridized to the DNA, indicating that the gene is inactive. Researchers frequently use this
technique to examine the activity of various genes at different times. When co-hybridizing Tumor
samples (Red Dye) and Normal sample (Green dye) together, they will compete for the synthetic
complementary DNAs on the microarray slide. As a result, if the spot is red, this means that that specific
gene is more expressed in tumor than in normal (up-regulated in cancer). If a spot is Green, that means
that gene is more expressed in the Normal tissue (Down regulated in cancer). If a spot is yellow that
means that that specific gene is equally expressed in normal and tumor.

7 Microarray Time Series Classification


1.4 Microarray types and usage

In [1], many types of array exist and the broadest distinction is whether they are spatially arranged on a
surface or on coded beads:

 The traditional solid-phase array is a collection of orderly microscopic "spots", called


features, each with a specific probe attached to a solid surface, such as glass, plastic or
silicon biochip (commonly known as a genome chip, DNA chip or gene array). Thousands
of them can be placed in known locations on a single DNA microarray.

 The alternative bead array is a collection of microscopic polystyrene beads, each with a
specific probe and a ratio of two or more dyes, which do not interfere with the fluorescent
dyes used on the target sequence.

In [1], DNA microarrays can be used to detect DNA (as in comparative genomic hybridization), or
detect RNA (most commonly as cDNA after reverse transcription) that may or may not be translated
into proteins. The process of measuring gene expression via cDNA is called expression analysis or
expression profiling.

8 Microarray Time Series Classification


Chapter two: Microarray methods

2.1 Independent Component Analysis (ICA)

In[2],ICA is a useful extension of PCA that has been developed in context with blind separation of
independent sources from their linear mixtures.7 Such blind separation techniques have been used, for
example, in various applications of auditory signal separation, medical signal processing, and so on.
Roughly speaking, rather than requiring that the coefficients of a linear expansion of the data vectors be
uncorrelated as in PCA, in ICA these coefficients must be mutually independent (or as independent as
possible). This implies that higher order statistics are needed in determining the ICA expansion.

2.2 ICA models of gene expression data


In [2], in this approach, ICA is used to find a matrix W such that the rows of U are as statistically
independent as possible. The independent eigenassays estimated by the rows of U are then used to
represent the snapshots. The representation of the snapshots consists of their corresponding coordinates
with respect to the eigenassays defined by the rows of U, These coordinates are

Figure 2.1

The gene expression data synthesis model. To find a set of independent basis snapshots (eigenassay), the
snapshots in X are considered to be a linear combination of statistically independent basis snapshots
(eigenassay, the rows in S), where W is the unmixing matrix and A =W−1 is an unknown mixing
matrix. The independent eigenassay is estimated as the output U of the learned ICA

9 Microarray Time Series Classification


2.3 Gene selection

In[2],this method allows to find the individual gene expression profiles that help to discriminate between
two classes by calculating for each gene expression profile gj a score based on the mean μ1j
(respectively μ2j) and the standard deviation σ1 j(respectively σ2 j ) of each class of samples. In this
study, we ranked the genes by their scores and retained a set of the top 500, 1000 and 2000 genes of the
two data sets for ICA, respectively.

In [3], these three methods will be used for experimental comparisons :

2.3.1 Gene ranking using correlation


Selects the best genes based on the discriminating power of the individual genes
2.3.2 Recursive backward feature elimination by SVM

2.3.3 Forward feature selection by FDA


FDA is a statistical dimension reduction method [10]. It finds a linear transformation that optimizes
class separability in the reduced dimensional space. The class separability is measured by using the between-
class and within-class scatters. A linear transformation which maximizes the between-class scatter and
minimizes the within-class scatter projects the original data to a one-dimensional space.

2.4 Classifiers
In [2], after processing the gene expression data using t-statistics and ICA, the final step is to classify the
data set. There have been many methods for performing the classification tasks so far, such as radial
basis function neural network (RBFNN), 11, 12 radial basis probabilities neural networks,13 logistic
discrimination (LD) and quadratic discriminant analysis (QDA),16 etc. Because the dimension of DNA
microarray gene expression data is higher even after they are processed by ICA, and there are only few
samples of the data achieved in general, we use support vector machines (SVM),8 which have been
proved to be very useful, to classify the gene expression data.

SVM is a relatively new type of machine learning model, originally introduced by Vapnik and co-
workers,24 and successively extended by a number of other researchers. This model, which is of
remarkably robust performance with respect to sparse and noisy data, is becoming a system of choice in
a number of applications from text categorization to protein function prediction.

When used for classification, SVM can separate a given set of binary labeled training data with a
hyperplane that is maximally distant from them (the maximal margin hyperplane). For the cases in
which no linear separation is possible, they can work in combination with the technique of “kernels”,
which automatically realizes a nonlinear mapping to a feature space. Generally, the hyperplane founded
by the SVM in feature space corresponds to a nonlinear decision boundary in the original space.

10Microarray Time Series Classification


Chapter Three: Data Mining for DNA Microarray

In [1], data mining for DNA microarray is to select discriminative genes related to classification from
gene expression data and train classifier with which classifies new data. After acquiring the gene
expression data calculated from the DNA microarray, prediction system has 2 stages: feature selection
and pattern classification.

In [1], the feature selection can be thought of as the gene selection, which is to get the list of genes that
might be informative for the prediction by statistical and information theoretical methods. Since it is
highly unlikely that all the genes have the information related to cancer and using all the genes results in
too big dimensionality, it is necessary to explore the efficient way to get the best feature. We have
extracted some informative genes using seven methods and the cancer predictor classifies the category
only with these genes.

In [1], given the gene list, a classifier makes decision as to which category the gene pattern belongs at
prediction stage. We have adopted four most widely used classification methods and an ensemble
classifier.

3.1Gene selection

In [1], among thousands of genes whose expression levels are measured, not all are needed for
classification. Microarray data consist of large number of genes in small samples. We need to select
some genes highly related to particular classes for classification, which are called informative genes.

In [1], using the statistical correlation analysis, we can see the linear relationship and the direction of
relation between two variables. Correlation coefficient r varies from -1 to +1, so that the data distributed
near the line biased to (+) direction will have positive coefficients, and the data near the line biased to (-)
direction will have negative coefficients.

11Microarray Time Series Classification


3.2Pattern classification

In [1], many algorithms designed for solving classification problems in machine learning have been
applied to recent research of prediction and classification of cancer with gene expression data. The
general process of classification in machine learning is to train classifiers to accurately recognize
patterns from given training samples and to classify test samples with the trained classifier.
Representative classification algorithms such as multi-layer perception, k-nearest neighbour, support
vector machine, and structure-adaptive self-organizing maps are applied to the classification. [3]The
optimization criterion for feature selection can be independent with the classifier, which works on the
selected features, or it can be combined with a classifier in both a selection process and a classification
stage.

In [4], we compare the performance of different discrimination methods for tumour classification
problems based on different feature selection methods. We consider four feature extraction methods and
five classification methods, from which 20 classification models can be derived. Each classification
model is a combination of one feature extraction method and one classification method. The feature
extraction methods are t -statistics, non-parametric Wilcoxon statistics, ad hoc signal-to-noise statistics,
and principal component analysis (PCA), and the classification methods are Fisher linear discriminate
analysis (FLDA), the support vector machine (SVM), the k nearest-neighbour classifier (kNN), diagonal
linear discriminant analysis (DLDA), and diagonal quadratic discriminate analysis (DQDA). The
selection of the parameters used in these feature extraction and classification methods is supervised.
These discrimination methods are then applied to three well-known publicly available microarray
datasets: acute leukaemia data , prostate cancer data , and lung cancer data.

Classification methods

Multilayer Perception (MLP)

In [1], a feed-forward multilayer perception (MLP) is error back propagation neural network that is
applied to many fields due to its powerful and stable learning algorithm. The neural network learns the
training examples by adjusting the synaptic weight of neurons according to the error occurred on the
output layer. The power of the back propagation algorithm lies in two main aspects: local for updating
the synaptic weights and biases, and efficient for computing all the partial derivatives of the cost
function with respect to these free parameters.

K-Nearest-Neighbor (KNN)

In [1], k-nearest neighbor (KNN) is one of the most common methods among memory based induction.
Given an input vector, KNN extracts k closest vectors in the reference set based on similarity measures,

12Microarray Time Series Classification


and makes decision for the input vector label using the labels of the k nearest neighbors. In [4] The k
nearest-neighbour rule proposed by Fix and Hodges , which uses distance function for pairs of tumour
mRNA samples, classifies test set observations based on the training set. For a given sample in the test
set, kNN first finds the k closest tumour samples from the training set and then predicts its class by
voting, i.e. by choosing the class that is most common among the k neighbours .

Support vector machine (SVM)

In [1], support vector machine (SVM) estimates the function classifying the data into two classes. SVM
builds up a hyperplane as the decision surface in such a way as to maximize the margin of separation
between positive and negative examples. SVM achieves this by the structural risk minimization
principle that the error rate of a learning machine on the test data is bounded by the sum of the training-
error rate and a term that depends on the Vapnik-Chervonenkis (VC) dimension. In [4] SVM finds the
best hyperplane separating the two classes in the training set. The best hyperplane is generally found by
maximizing the sum of the distances from the hyperplane to the closest positive and negative correctly
classified observations, while penalizing for the
number of misclassifications .

In [6], support Vector Machine (SVM) classification algorithm to evaluate the selected features, and to
establish the influence on classification accuracy. The results indicate that in terms of the number of
genes that need to be selected and classification accuracy the proposed method is superior to other
methods in the literature.

Ensemble classifier

In [1], Classification can be defined as the process to approximate I/O mapping from the given
observation to the optimal solution. Generally, classification tasks consist of two parts: feature selection
and classification. Feature selection is a transformation process of observations to obtain the best
pathway to get to the optimal solution. Therefore, considering multiple features encourages obtaining
various candidate solutions so that we can estimate a more accurate solution to the optimal than any
other local optima.

In [1], when we have multiple features available, it is important to know which features should be used.
Theoretically, for many features concerned, it may be more effective for the classifier to solve the
problems. But features that are overlapped in feature spaces may cause the redundancy of irrelevant
information and result in the counter effect such as over fitting. Therefore, it is more important to
explore and utilize independent features to train classifiers, rather than increase the number of features
we use. Correlation between feature sets can be induced from the distribution of features.

13Microarray Time Series Classification


Chapter Four: Classification Algorithms
In [1], there are many algorithms for the classification from machine learning approach, but none of
them is perfect. Moreover, it is always difficult to decide what to use and how to set up its parameters.
According to the environments when the classifier is embedded, some algorithms work well and others
do not. This is because the classifier searches in different solution space depending on the algorithms,
features and parameters used. These sets of classifiers produce their own outputs; therefore the ensemble
classifier can explore a wider solution space.

Wavelet Approach

In [5], A wavelet transform is a lossless linear transformation of a signal or data into coefficients on the
basis of wavelet functions.5,6 The coefficients yielded by wavelet transform contain information about
characteristics of the data at different scales. Fine scales capture local details of coefficients and coarse
scales capture global features of a signal. Performing the discrete wavelet transform (DWT) of a signal x
is done by passing it through low pass filters (scaling functions) and high pass filter

Figure 3.1

Wavelet power spectrum is a graphical representation having a cumulative information variation


measures at each decomposition level as data points. The global information variation of gene
expression in a given sample can thus be consolidated by plotting wavelet power spectrum. This feature
of a power spectrum may be useful in identifying the characteristics of a microarray data. In addition to
providing a visualization of features of a microarray data, the global strength of gene expression in a

14Microarray Time Series Classification


given sample can be consolidated by plotting wavelet power spectrum and it may reveal many other
hidden structures of the data.

In [7], accurate classification of microarray data plays a vital role in cancer prediction and diagnosis.
Previous studies have demonstrated the usefulness of naïve Bayes classifier in solving various
classification problems. In microarray data analysis, however, the conditional independence assumption
embedded in the classifier itself and the characteristics of microarray data, e.g. the extremely high
dimensionality, may severely affect the classification performance of naïve Bayes classifier. This paper
presents a sequential feature extraction approach for naïve Bayes classification of microarray data. The
proposed approach consists of feature selection by stepwise regression and feature transformation by
class-conditional independent component analysis. Experimental results on five microarray datasets
demonstrate the effectiveness of the proposed approach in improving the performance of naïve Bayes
classifier in microarray data analysis.

Figure 3.2

15Microarray Time Series Classification


In [1], we have applied this idea to a classification framework as shown. Given k features
and n classifiers, there is k x n feature-classifier combinations. There are k x n Cm
possible ensemble classifiers when m feature-classifier combinations are selected for
ensemble classifier. Classifiers are trained using the features selected, and finally a
combining module is accompanied to combine their outputs. After classifiers are trained
independently with some features to produce their own outputs, the final answer will be
judged by a combining module, where the majority voting, weighted voting, or Bayesian
combination can be adopted.

 Majority Voting: It is a simple ensemble method that selects the class


most favored by base classifiers. Majority voting has some advantages that
it does not require any previous knowledge nor does it require any
additional complex computation to decide.

 Weighted Voting: Poor classifier can affect the result of the ensemble in
majority voting because it gives the same weight to all classifiers.
Weighted voting reduces the effect of poor classifier by giving a different
weight to a classifier based on the performance of each classifier.

 Bayesian Combination: While majority voting method combines classifiers with their
results, Bayesian combination makes the error possibility of each classifier affect the final
result. The method combines classifier with different weight by using the previous
knowledge of each classifier.

Microarray time series classification 1


6
Chapter Five: Conclusion

In this survey, we introduce DNA microarray technology and how to use


these technologies. We introduce microarray types and methods. We
introduce the benefits of data mining in microarray time series classification
by using gene selection and pattern classifier such that K-Nearest Neighbor
and Support Vector Machine

Chapter Six: Future Work

In the future, we plan to use the microarray technology for treatment of


cancer diseases as hemophilia, brain tumors and lung cancer

Microarray time series classification 1


7
Chapter Seven: References

[1]SUNG-BAE CHO_ and HONG-HEE WONy,” DATA MINING FOR GENE


EXPRESSION PROFILES FROM DNA MICROARRAY”, International Journal of
Software Engineering, 134 Shinchon-dong, Sudaemoon-ku, Seoul 120-749, Korea, 593-
608, 2003

[2] YAN CHEN and XIU-XIA LI, YI-XUE LI, YUN-PING ZHU, CHUN-HOU
ZHENG, “TUMOR CLASSIFICATION BASED ON INDEPENDENT
COMPONENT ANALYSIS”, International Journal of Pattern Recognition and
Artificial Intelligence, China, pp. 297–310 , November 02, 2006

[3] C. H. PARK*†, M. JEON‡, P. PARDALOS§ and H. PARK,” Quality assessment of


gene selection in microarray data”, Optimization Methods and Software, 220 Gung-
dong,Yuseong-gu, Daejeon, 305-764, Korea, Atlantic Drive, Atlanta, GA, 30332,
USA,pp. 145–154, 1055-6788, February 2007

[4] JING ZHANG, TIANZI JIANG*, BING LIU, XINGPENG JIANG and HUIZHI
ZHAO, “Systematic benchmarking of microarray data feature extraction and
classification”, International Journal of Computer Mathematics, pp 803–811, May 2008

[5] S. PRABAKARAN¤, R. SAHU and S. VERMA, " A WAVELET APPROACH


FOR CLASSIFICATION OF MICROARRAY DATA”, International Journal of
Wavelets, Multiresolution and Information Processing, pp. 375–389, May 2008.

[6] Cheng-San Yang, Li-Yeh Chuang, Chao-Hsuan Ke, and Cheng-Hong Yang, Member,
IAENG“A Hybrid Feature Selection Method for Microarray Classification”. IAENG
International Journal of Computer Science,pp 285-290, 2008.

[7] Fan, Liwei, Zhou, Peng “A sequential feature extraction approach for naïve
bayes classification of microarray data”. Article, pp 9919-9923., Aug2009.

[8] Sung-Bae, Kyung-Joong, "An Evolutionary Algorithm Approach to Optimal


Ensemble Classifiers for DNA Microarray Data Analysis." IEEE Transactions on
Evolutionary Computation, pp. 377-388, Jun 2008.

Microarray time series classification 1


8
Microarray time series classification 1
9

S-ar putea să vă placă și