0 evaluări0% au considerat acest document util (0 voturi)
31 vizualizări22 pagini
- The document proposes a two-phase iterative approach to discover overlapping gene expression clusters in noisy data. Phase 1 uses clustering algorithms to generate initial non-overlapping partitions. Phase 2 redetermines gene membership in partitions by mining association patterns between gene expression levels and partition labels. The approach was tested on artificial and real gene expression datasets and showed improved performance over traditional clustering algorithms.
- The document proposes a two-phase iterative approach to discover overlapping gene expression clusters in noisy data. Phase 1 uses clustering algorithms to generate initial non-overlapping partitions. Phase 2 redetermines gene membership in partitions by mining association patterns between gene expression levels and partition labels. The approach was tested on artificial and real gene expression datasets and showed improved performance over traditional clustering algorithms.
- The document proposes a two-phase iterative approach to discover overlapping gene expression clusters in noisy data. Phase 1 uses clustering algorithms to generate initial non-overlapping partitions. Phase 2 redetermines gene membership in partitions by mining association patterns between gene expression levels and partition labels. The approach was tested on artificial and real gene expression datasets and showed improved performance over traditional clustering algorithms.
Patterns in Noisy Gene Expression Domain : Data Mining Name : Najana Sahithi Mayukha Reg.No: 220003075 Class : IV B.TECH CSE-B Abstract • Many clustering problems are defined as partitioning problems in the sense that the similar records are grouped into nonoverlapping partitions. • However, the clustering of gene expression data to discover coexpressed genes may not always be meaningful if this problem is reduced into a partitioning problem. • This poses a challenge to many clustering algorithms as they are not originally developed to discover overlapping clusters in noisy gene expression data. • We propose an iterative data mining approach that consists of two phases.The proposed approach has been tested with both artificial and real datasets Introduction • To effectively discover overlapping clusters in noisy gene expression data, we here propose an iterative data mining approach that consists of two phases as follows. • Coexpressed genes are genes that have similar transcriptional response to external stress.Effective clustering of gene expression data is therefore very important as it can make the identification of coexpressed genes and the discovery of hidden patterns in them much easier. • We provide an overview of clustering algorithms that have been commonly used for discovering nonoverlapping partitions in gene expression data. Methodology • The Hierarchical Agglomerative Clustering algorithm.
• The k-means algorithm.
• The Self-Organizing Map(SOM) algorithm.
• The Information-Theoretic Co-Clustering algorithm
(ITC)
• The Bi-Clustering algorithm (BC)
The Hierarchical Agglomerative Clustering algorithm • It performs a series of successive fusions of records into partitions. The fusion process is guided by a measure of similarity between partitions so that partitions that are similar to each other are merged. • This fusion process is repeated until all partitions are merged into a single partition. • The results of the fusion process are normally presented in the form of a 2-D hierarchical structure, called a dendrogram. • The records falling along each branch in a dendrogram form a partition.Users are then required to decide how many different numbers of partitions they would like to have by cutting through the branches at different levels in a dendrogram. The k-means algorithm • the k-means algorithm requires its users to specify such number k ahead of time. • Given the number of partitions, the k-means algorithm randomly selects k records as initial partition centroids. The centroid of each partition is then recalculated as the mean of all records belonging to the partition. • This process of assigning records to the nearest partitions and recalculating the positions of the centroids is then performed iteratively until the positions of the centroids remain unchanged. The Self-Organizing Map(SOM) algorithm • It is one of the best-known artificial neural network algorithms that can be used for data grouping. • The reference vectors together form a codebook. The neurons of the map are connected to adjacent neurons by a neighborhood relation, which dictates the topology of the map. • The basic SOM algorithm requires the topology and the number of neurons to remain fixed from the beginning. The number of neurons determines the granularity of the mapping, which has an effect on the accuracy and generalization of the SOM. The Information-Theoretic Co- Clustering algorithm (ITC) • ITC can be applied in every situation where a data matrix is given in which its elements represent the relation between its records and its attributes. • In the simplest case where only a binary matrix is used, a cocluster corresponds to a biclique in the corresponding bipartite graph. • For data clustering, ITC can only work on the data matrices with the assumption of the existence of a number of mutually exclusive clusters in datasets. The Bi-Clustering algorithm (BC) • BC has been proposed to cluster both records and attributes simultaneously. • It aims at identifying subsets of records and subsets of attributes by performing simultaneous clustering of both of them instead of clustering them separately. • Specifically, it groups a subset of records and a subset of attributes into a bicluster such that the records and attributes exhibit similar behavior. • A bicluster is formed if its mean squared residue is less than or equal to a user-specified threshold. Disadvantages of Existed System • When dealing with gene expression data, it should be noted that these algorithms may not always be able to perform effectively. • They do not give very accurate measurements when dealing with noisy data. • Clustering algorithms based on these measures may, therefore, miss important global information. The Proposed System • We are given a set of gene expression data G consisting of the data collected from N genes in M experiments carried out under different sets of conditions. • Let us represent the dataset as a set of N genes G = {g1, . . . ,gi, . . . , gN }, with each gene gi, i = 1, . . . , N, characterized by M attributes E1, . . . , Ej , . . . , EM whose values ei1, . . . ,eij , . . . , eiM ,where eij ∈ domain(Ej ), represent the expression value of the ith gene under the jth experimental condition. • Phase 1—Discovering Initial Partitioning of Genes. • Phase 2—Re determining the Partition Memberships of Genes by Mining Overlapping Co expression Patterns Phase 1: • Any clustering algorithms mentioned in existed system can be used. • Since these algorithms have been used with gene expression data with some promising results, we, therefore, use these algorithms in turn to discover the initial partitioning of genes in a way such that each gene gi belongs to one of the partitions Pp , where p = 1, . . . , P and P is the total number of initial partitions discovered. Phase 2: • In this phase, a pattern discovery technique, which consists of two steps, is used to re determine the partition membership of each gene iteratively. • Step 1—Discovering Interesting Association Patterns: Interesting association patterns are discovered in each partition by detecting associations between the levels of each attribute Ej of the genes that belong to a particular partition and the partition label itself. • Based on the mutual information, the weight of evidence W can be interpreted as a measure of the difference in the gain information. The weight of evidence is positive if ejk provides positive evidence supporting the assignment of a gene to Pp ; otherwise, it is negative. Phase 2: • Step 2—Re determining the Partition Memberships of Genes: Given a collection of the selected attributes (M ≤ M) that can be utilized to construct characteristic descriptions of Pp , the total weight of evidence (TW) supports the assignment of a gene to Pp. • If TW for a gene to be assigned to any partition be greater than zero or a userdefined threshold, the gene can be considered as a member of that partition. • Based on TW, each gene that has previously been assigned to a particular partition is then reevaluated to determine whether it should be: 1) kept in the same partition; or 2) moved to another partition; or 3) assigned to more than one partition. • Then, phase 2 is repeated until the partition memberships of genes remain unchanged. Experimental Results • The proposed approach has been used for clustering tasks involving artificial and real datasets and its effectiveness has also been evaluated using the common statistical measures. • In addition, the biological significance of the discovered clusters is also carefully analyzed. Experimental Datasets • For experimentation, an artificial dataset containing 450 genes was used. • In addition to the artificial dataset, we have also tested the proposed approach using two sets of real gene expression data. The first dataset (ED1) contains a set of 517 genes whose expression levels vary in response to serum concentration in human fibroblasts . The second dataset (ED2) contains a set of 2945 genes whose expression levels were measured under different experimental conditions Evaluation Measures • Two statistical performance measures that are commonly used for gene expression data analysis. • To evaluate the effectiveness of the proposed approach based on biological significance, we therefore look at the percentage of genes in each function category discovered in the initial non overlapping partitions after phase 1 to see if there is a corresponding increase in the overlapping clusters discovered after phase 2. Experimental Results • Clustering algorithms based on these measures may, therefore, miss important global information. • Since the membership reevaluation measure used by the proposed pattern discovery technique is able to consider global information as reflected by the cluster arrangement and also able to distinguish between relevant and irrelevant expression data, therefore, it is able to be robust to noisy data and can improve the performances of other clustering algorithms as well. • Other than Rand index and silhouette measure, we have also evaluated the clustering results according to their biological functions.We look at percentages of gene after phase-1 and phase-2. Conclusion • In this paper, we propose an iterative data mining approach, which consists of two phases, to discover overlapping clusters in noisy gene expression data. • In phase 1, a clustering algorithm is used to discover the initial, non overlapping partitions in gene expression data. • Then, the partition membership of each gene in the initial partition is redetermined iteratively in phase 2 by the pattern discovery technique so as to determine that if a gene should remain in the same partition, be moved to another partition. Conclusion • In addition, it is able to distinguish between interesting and uninteresting expression data and identify the relevancy of an expression level in determining a particular grouping arrangement by discovering interesting patterns between them. References • [1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. San Mateo, CA: Morgan Kaufmann, 2006. • [2] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. Nat. Acad. Sci. USA, vol. 95, no. 25, pp. 14863–14868, 1998. • [3] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, “Systematic determination of genetic network architecture,” Nature Genet., vol. 22, no. 3, pp. 281–285, 1999. • [4] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub, “Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation,” Proc. Nat. Acad. Sci. USA, vol. 96, no. 6, pp. 2907–2912, 1999.