Sunteți pe pagina 1din 22

An Iterative Data Mining Approach

for Mining Overlapping Co Expression


Patterns in Noisy Gene Expression
Domain : Data Mining
Name : Najana Sahithi Mayukha
Reg.No: 220003075
Class : IV B.TECH CSE-B
Abstract
• Many clustering problems are defined as partitioning problems in the sense
that the similar records are grouped into nonoverlapping partitions.
• However, the clustering of gene expression data to discover coexpressed
genes may not always be meaningful if this problem is reduced into a
partitioning problem.
• This poses a challenge to many clustering algorithms as they are not
originally developed to discover overlapping clusters in noisy gene
expression data.
• We propose an iterative data mining approach that consists of two
phases.The proposed approach has been tested with both artificial and real
datasets
Introduction
• To effectively discover overlapping clusters in noisy gene
expression data, we here propose an iterative data mining
approach that consists of two phases as follows.
• Coexpressed genes are genes that have similar transcriptional
response to external stress.Effective clustering of gene
expression data is therefore very important as it can make the
identification of coexpressed genes and the discovery of hidden
patterns in them much easier.
• We provide an overview of clustering algorithms that have been
commonly used for discovering nonoverlapping partitions in
gene expression data.
Methodology
• The Hierarchical Agglomerative Clustering algorithm.

• The k-means algorithm.

• The Self-Organizing Map(SOM) algorithm.

• The Information-Theoretic Co-Clustering algorithm


(ITC)

• The Bi-Clustering algorithm (BC)


The Hierarchical Agglomerative Clustering
algorithm
• It performs a series of successive fusions of records into partitions. The
fusion process is guided by a measure of similarity between partitions so
that partitions that are similar to each other are merged.
• This fusion process is repeated until all partitions are merged into a
single partition.
• The results of the fusion process are normally presented in the form of a
2-D hierarchical structure, called a dendrogram.
• The records falling along each branch in a dendrogram form a
partition.Users are then required to decide how many different numbers
of partitions they would like to have by cutting through the branches at
different levels in a dendrogram.
The k-means algorithm
• the k-means algorithm requires its users to specify such number k
ahead of time.
• Given the number of partitions, the k-means algorithm randomly
selects k records as initial partition centroids. The centroid of each
partition is then recalculated as the mean of all records belonging to
the partition.
• This process of assigning records to the nearest partitions and
recalculating the positions of the centroids is then performed
iteratively until the positions of the centroids remain unchanged.
The Self-Organizing Map(SOM)
algorithm
• It is one of the best-known artificial neural network algorithms that
can be used for data grouping.
• The reference vectors together form a codebook. The neurons of the
map are connected to adjacent neurons by a neighborhood relation,
which dictates the topology of the map.
• The basic SOM algorithm requires the topology and the number of
neurons to remain fixed from the beginning. The number of neurons
determines the granularity of the mapping, which has an effect on the
accuracy and generalization of the SOM.
The Information-Theoretic Co-
Clustering algorithm (ITC)
• ITC can be applied in every situation where a data matrix is given in
which its elements represent the relation between its records and its
attributes.
• In the simplest case where only a binary matrix is used, a cocluster
corresponds to a biclique in the corresponding bipartite graph.
• For data clustering, ITC can only work on the data matrices with the
assumption of the existence of a number of mutually exclusive clusters
in datasets.
The Bi-Clustering algorithm (BC)
• BC has been proposed to cluster both records and attributes
simultaneously.
• It aims at identifying subsets of records and subsets of attributes by
performing simultaneous clustering of both of them instead of
clustering them separately.
• Specifically, it groups a subset of records and a subset of attributes
into a bicluster such that the records and attributes exhibit similar
behavior.
• A bicluster is formed if its mean squared residue is less than or equal
to a user-specified threshold.
Disadvantages of Existed System
• When dealing with gene expression data, it should be noted that these
algorithms may not always be able to perform effectively.
• They do not give very accurate measurements when dealing with noisy
data.
• Clustering algorithms based on these measures may, therefore, miss
important global information.
The Proposed System
• We are given a set of gene expression data G consisting of the data
collected from N genes in M experiments carried out under different sets
of conditions.
• Let us represent the dataset as a set of N genes G = {g1, . . . ,gi, . . . , gN
}, with each gene gi, i = 1, . . . , N, characterized by M attributes E1, . . . ,
Ej , . . . , EM whose values ei1, . . . ,eij , . . . , eiM ,where eij ∈ domain(Ej
), represent the expression value of the ith gene under the jth
experimental condition.
• Phase 1—Discovering Initial Partitioning of Genes.
• Phase 2—Re determining the Partition Memberships of Genes by Mining
Overlapping Co expression Patterns
Phase 1:
• Any clustering algorithms mentioned in existed system can be used.
• Since these algorithms have been used with gene expression data with
some promising results, we, therefore, use these algorithms in turn to
discover the initial partitioning of genes in a way such that each gene
gi belongs to one of the partitions Pp , where p = 1, . . . , P and P is the
total number of initial partitions discovered.
Phase 2:
• In this phase, a pattern discovery technique, which consists of two
steps, is used to re determine the partition membership of each gene
iteratively.
• Step 1—Discovering Interesting Association Patterns: Interesting
association patterns are discovered in each partition by detecting
associations between the levels of each attribute Ej of the genes that
belong to a particular partition and the partition label itself.
• Based on the mutual information, the weight of evidence W can be
interpreted as a measure of the difference in the gain information. The
weight of evidence is positive if ejk provides positive evidence
supporting the assignment of a gene to Pp ; otherwise, it is negative.
Phase 2:
• Step 2—Re determining the Partition Memberships of Genes: Given a
collection of the selected attributes (M ≤ M) that can be utilized to
construct characteristic descriptions of Pp , the total weight of evidence
(TW) supports the assignment of a gene to Pp.
• If TW for a gene to be assigned to any partition be greater than zero or a
userdefined threshold, the gene can be considered as a member of that
partition.
• Based on TW, each gene that has previously been assigned to a particular
partition is then reevaluated to determine whether it should be: 1) kept in
the same partition; or 2) moved to another partition; or 3) assigned to
more than one partition.
• Then, phase 2 is repeated until the partition memberships of genes
remain unchanged.
Experimental Results
• The proposed approach has been used for clustering tasks
involving artificial and real datasets and its effectiveness has also
been evaluated using the common statistical measures.
• In addition, the biological significance of the discovered clusters
is also carefully analyzed.
Experimental Datasets
• For experimentation, an artificial dataset containing 450 genes was
used.
• In addition to the artificial dataset, we have also tested the proposed
approach using two sets of real gene expression data. The first dataset
(ED1) contains a set of 517 genes whose expression levels vary in
response to serum concentration in human fibroblasts . The second
dataset (ED2) contains a set of 2945 genes whose expression levels
were measured under different experimental conditions
Evaluation Measures
• Two statistical performance measures that are commonly used for
gene expression data analysis.
• To evaluate the effectiveness of the proposed approach based on
biological significance, we therefore look at the percentage of genes in
each function category discovered in the initial non overlapping
partitions after phase 1 to see if there is a corresponding increase in the
overlapping clusters discovered after phase 2.
Experimental Results
• Clustering algorithms based on these measures may, therefore, miss
important global information.
• Since the membership reevaluation measure used by the proposed
pattern discovery technique is able to consider global information as
reflected by the cluster arrangement and also able to distinguish
between relevant and irrelevant expression data, therefore, it is able to
be robust to noisy data and can improve the performances of other
clustering algorithms as well.
• Other than Rand index and silhouette measure, we have also evaluated
the clustering results according to their biological functions.We look at
percentages of gene after phase-1 and phase-2.
Conclusion
• In this paper, we propose an iterative data mining approach, which
consists of two phases, to discover overlapping clusters in noisy
gene expression data.
• In phase 1, a clustering algorithm is used to discover the initial, non
overlapping partitions in gene expression data.
• Then, the partition membership of each gene in the initial partition
is redetermined iteratively in phase 2 by the pattern discovery
technique so as to determine that if a gene should remain in the
same partition, be moved to another partition.
Conclusion
• In addition, it is able to distinguish between interesting and
uninteresting expression data and identify the relevancy of an
expression level in determining a particular grouping arrangement by
discovering interesting patterns between them.
References
• [1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed.
San Mateo, CA: Morgan Kaufmann, 2006.
• [2] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster
analysis and display of genome-wide expression patterns,” Proc. Nat. Acad.
Sci. USA, vol. 95, no. 25, pp. 14863–14868, 1998.
• [3] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church,
“Systematic determination of genetic network architecture,” Nature Genet.,
vol. 22, no. 3, pp. 281–285, 1999.
• [4] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky,
E. S. Lander, and T. R. Golub, “Interpreting patterns of gene expression with
self-organizing maps: Methods and application to hematopoietic
differentiation,” Proc. Nat. Acad. Sci. USA, vol. 96, no. 6, pp. 2907–2912,
1999.

S-ar putea să vă placă și