Documente Academic
Documente Profesional
Documente Cultură
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 1
Abstract—Analysis of simultaneous clustering of gene expression with biological knowledge has now become an important
technique and standard practice to present a proper interpretation of the data and its underlying biology. However, common
clustering algorithms do not provide a comprehensive approach that look into the three categories of annotations; biological
process, molecular function, and cellular component, and were not tested with different functional annotation database formats.
Furthermore, the traditional clustering algorithms use random initialization which causes inconsistent cluster generation and are
unable to determine the number of clusters involved. In this paper, we present a novel computational framework called CluFA
(Clustering Functional Annotation) for semi-supervised clustering of gene expression data. The framework consists of three
stages: (i) preparation of Gene Ontology (GO) datasets, functional annotation databases, and testing datasets, (ii) a fuzzy c-
means clustering to find the optimal clusters; and (iii) analysis of computational evaluation and biological validation from the
results obtained. With combination of the three GO term categories (biological process, molecular function, and cellular
component) and functional annotation databases (Saccharomyces Genome Database (SGD), the Yeast Database at Munich
Information Centre for Protein Sequences (MIPS), and Entrez), the CluFA is able to determine the number of clusters and
reduce random initialization. In addition, CluFA is more comprehensive in its capability to predict the functions of unknown
genes. We tested our new computational framework for semi-supervised clustering of yeast gene expression data based on
multiple functional annotation databases. Experimental results show that 76 clusters have been identified via GO slim dataset.
By applying SGD, Entrez, and MIPS functional annotation database to reduce random initialization, performance on both
computational evaluation and biological validation were improved. By the usage of comprehensive GO term categories, the
lowest compactness and separation values were achieved. Therefore, from this experiment, we can conclude that CluFA had
improved the gene function prediction through the utilization of GO and gene expression values using the fuzzy c-means
clustering algorithm by cross referencing it with the latest SGD annotation.
Index Terms—Fuzzy c-means, Gene expression, Gene ontology, Gene function prediction, Semi-supervised clustering
—————————— ——————————
1 INTRODUCTION
[5, 52], Self Organizing Map (SOM) [8, 37], biclustering [1, 2 METHODS
15], and fuzzy c-means [60, 51, 4, 30]. The k-means algo-
CluFA consists of three stages as shown in Fig. 1. The first
rithm is used to partition number of objects into k clusters
stage is the preparation of data in which we used GO da-
in which each object belongs to the cluster with the near-
tasets, functional annotation databases, and testing data-
est mean. SOM creates a set of prototype vectors which
sets. These datasets are used in our semi-supervised clus-
represent the data and visualize the prototype. Bi-
tering process. In the second stage, fuzzy c-means cluster-
clustering algorithm refers to a subset of genes that be-
ing is implemented. It is carried out to assign each gene
have similarly in a subset of conditions. Fuzzy c-means
with the membership values according to their biological
algorithm calculates centroid of each cluster and gives a
cluster. Lastly, the final stage is the analysis of the results
degree of membership to each data. Fuzzy c-means algo-
in which two evaluation criteria were considered: (i)
rithm is considered superior to other algorithms in which
computational evaluation, and (ii) biological validation.
it gives a probability of belonging to each cluster to pro-
In the computational evaluation, we evaluated our results
ducing more accurate clusters [14, 22].
using a) compactness and separation, b) consistency, and
The popular usage of the GO for clustering gene ex-
c) accuracy. These evaluations are performed in order to
pression data has resulted in many experiments using
measure the ratio of compact to separate clusters. Moreo-
various species for examples Arabidopsis thaliana, Candida
ver, the consistency and accuracy evaluations are done to
albicans, Mus musculus, and S. cerevisiae. The baker’s yeast,
assign gene per annotation in each clusters. Meanwhile,
S. cerevisiae was the first eukaryote whose genome had
in biological validation, the unknown gene function is
been completely sequenced. There are several databases
predicted. In order to validate the predictions, the pre-
specifically dedicated to functional analysis of the yeast
dicted genes were cross checked against the SGD, Entrez,
genome which includes three major databases - the Sac-
and MIPS annotation. Below, we describe each of these
charomyces Genome Database (SGD) [10], the Yeast Da-
processes.
tabase at Munich Information Centre for Protein Se-
quences (MIPS) [38], and Entrez [47]. Works related to Preparation of Data Clustering Analysis of Result
these databases - SGD [8, 51, 21, 1], MIPS [12, 3, 5, 32, 45],
and Entrez [9, 41, 27] provide useful and most recent in-
formation to further develop yeast genome analysis, in- GO Datasets
(GOSlim, GO term)
Number of Cluster
Initialization
Computational
Evaluation
cluding periodically updated lists of proteins with known
or predicted functions, phenotypes of mutants (if availa- Functional
Annotation Databases Biological Validation
ble), protein-protein interactions, and gene expression (SGD, Entrez, MIPS) Fuzzy Membership
patterns. Initialization
Despite all the findings discussed above, there are still Testing Datasets
(Eisen, Gasch) Centroid Calculation
several drawbacks detected throughout the literature.
These drawbacks are:
Fuzzy Membership
Number of clusters was not defined in the begin- Update
ning of the experiment.
Usage of non-comprehensive GO term categories,
Fig 1. The computational framework of CluFA.
resulting in limited understanding of data.
Only looks into one particular functional annota-
2.1 Preparation of Data
tion database [54, 46].
Random assignment of gene for some clustering
algorithms [48, 33] causes inconsistent generation 2.1.1 GO Datasets
of clusters. Currently in the GO website (http://www.geneontolo
To overcome these drawbacks, we present a computa- gy.org), there are nearly 28,108 terms which refer to the
tional framework named CluFA (Clustering Functional controlled vocabulary used to describe gene and gene
Annotation) that uses fuzzy c-means as the clustering product attributes in any organism. These terms are clas-
algorithm. The computational framework consists of four sified as one of these three ontologies: cellular compo-
steps which are (i) initialization of cluster number, (ii) nent, biological process or molecular function. Each term
initialization of fuzzy membership, (iii) calculation of cen- is structured as a Directed Acyclic Graph (DAG). In this
troid, and (iv) fuzzy membership update. The novel con- study, we used GO slim and GO term dataset from older
tributions of CluFA are embedded in steps (i) and (ii). The
version in order to predict new gene functions. We ap-
GO slim was used to automatically defined the number of
plied GO slim dataset to form the initial clusters. The GO
clusters, thus overcoming the first above-mentioned
drawback. Furthermore, all GO terms categories were slim is a subset of GO terms in which some of the terms
used to produce efficient result and comprehensive out- are placed at a higher level in the GO hierarchy. In GO,
put to cover all terms in the GO. In step (ii), we used more there are GO slim for different organisms or usage. We
than one functional annotation databases, which are SGD, used the GO slim yeast dataset in OBO (Open Biomedical
MIPS, and Entrez from S. cerevisiae in order to reduce Ontologies) format, which was generated in September
random initialization during the assignment of cluster 2005. The GO slim yeast consists of 76 terms. Of these 76
membership. terms, there are 32 terms in biological process, 21 terms in
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 3
molecular function, and 23 terms in cellular component. GO:0046658 (cellular component; anchored to plasma mem-
To identify and assign each gene to its corresponding brane). For instance GO:0046658 has two parents under
cluster, GO term database in MySQL format was down- descendant of cl , GO:0005886 (cellular component; plas-
loaded (updated in September 2005). There are a total ma membrane) and GO:0016020 (cellular component; mem-
number of 19,458 terms and its relationship in this data- brane). The rest of the GO terms have only one parent. For
set. example, GO:0009277 has only one parent, GO:0005618
(cellular component; cell wall) and GO:0031505 has one
2.1.2 Functional Annotation Databases parent, GO:0007047 (biological process; cell wall organiza-
The functional annotation databases used in this experi- tion and biogenesis), and GO:0005199 has only one parent,
ment were downloaded from three different sources, GO:0005198 (molecular function; structural molecule activi-
ty). This example of cluster initialization is illustrated in
namely SGD, Entrez and MIPS. The SGD file used was
Fig. 2.
compiled in September 2005 comprises of 33,651 genes. It
is a scientific database of the molecular biology and ge- Gene ID Associated GO Term Related to GO Slim cluster
netics of the yeast S. cerevisiae. Entrez ‘gene2go’ file was
downloaded in June 2009, which consists of 52,351 genes.
GO:0009277 GO:0005618
It is known as an integrated search and retrieval system
that provides global queries for cross-databases. Lastly,
GO:0031505 GO:0007047
the MIPS file was used to provide a numeric, hierarchical YLR194C
system and to denote the various classes of biological
GO:0005199 GO:0005198
functions. In MIPS, two files were downloaded: ‘funcat-
2.1’ and ‘mips2go’, which were compiled in March 2007
and they consist of 15,924 genes. From these databases, GO:0046658 GO:0005886
we can also extract the GO annotation evidence code
from these databases to reduce the random initialization GO:0016020
of fuzzy membership.
2.1.3 Testing Datasets Fig. 2. An example of how gene YLR194C is assigned to its corres-
We test our CluFA on Eisen et al. [18] and Gasch et al. [24] ponding clusters.
of yeast gene expression datasets. The Eisen dataset con-
tains the expression profiles of 6,221 yeast genes with 80 2.2.2 Fuzzy membership initialization
samples taken during the diauxic shift, the mitotic cell
In the initial determination of membership, once the gene
division cycle, sporulation, temperature, and reducing
g has been assigned to its corresponding cl (s), the ini-
shocks. Meanwhile, the Gasch dataset contains 6,152
genes with 173 samples test on gene expression behaviour U (o) g is defined using the
tial membership value, ij for
during various stress conditions.
following formula:
U ij ( o ) rsij 1 r i 1,..., g n j 1,..., cll
2.2 Clustering , , (1)
where rsij is the reliability score according to GO annota-
2.2.1 Number of clusters initialization tion evidence code from GO annotation in SGD, Entrez,
In this step, a set of GO slim terms is used to initialize and MIPS which support g to be in cl . The use of con-
number of clusters. Given GO slim as GOall , we used all
the 76 terms in the GOall in which each term of stant , is to give some variation in each iteration and r
the GOall form a separate cluster, cl . Once the clusters g
is a small constant to denote a level of reliability when i
have been determined, we assigned genes in the gene
has no GO annotation evidence code. The value of relia-
expression data for each of these clusters. The gene ex-
pression dataset is represented as G meanwhile each bility score is retrieved by assigning a weight of GO anno-
gene in G dataset is represented as g where g G . Let tation evidence code based on the hierarchy of reliability
t be a descendant of cl in a GO hierarchy where from the GO between 0 and 1 in which number closest to
cl GOall , each g in t is assigned to its corresponding 1 being the most reliable. During the assignment of initial
cl (s) where there can be more than one cl for each g. This membership, the most reliable score is chosen when there
is due to the structure of the GO which is a DAG (Direct exists more than one GO annotation evidence code as-
Acyclic Graph). For example, gene ‘YLR194C’ has four signed to a particular gene.
GO terms which are GO:0009277 (cellular component; The GO annotation evidence code and its reliability score are
chitin- and beta-glucan-containing cell wall), GO:0031505 stated as follows: Inferred from Electronic Annotation (IEA: 0.5),
(biological process; chitin- and beta-glucan-containing cell
Non‐traceable Author Statement (NAS: 0.6), Inferred from Re‐
wall organization and biogenesis), GO:0005199 (molecular
viewed Computational Analysis (RCA: 0.7), Inferred from Se‐
function; structural constituent of cell wall), and
quence or Structural Similarity (ISS: 0.7), Inferred by Curator (IC:
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 4
0.7), Inferred from Expression Pattern (IEP: 0.7), Inferred from TABLE 2
Physical Interaction (IPI: 0.8), Inferred from Mutant Phenotype
THE DISTRIBUTION NUMBER OF GENES WITH GO ANNOTATION
(IMP: 0.8), Inferred from Genetic Interaction (IGI: 0.8), Inferred
EVIDENCE CODE AND WITHOUT GO ANNOTATION EVIDENCE
from Direct Assay (IDA: 0.9), and Traceable Author Statement CODE.
(TAS: 0.9). Next, the functional annotation databases are ap‐
plied in which we checked their GO annotation evidence
code through SGD, MIPS, and Entrez gene separately. From
these three databases, we matched the gene with its respec‐
tive rsij . An example of initial membership assignment for
these dataset is shown in Table 1.
TABLE 1
(o )
2.2.3 Centroid calculation
AN EXAMPLE OF U INITIAL MEMBERSHIP ASSIGNMENT WHEN
ij
Next, we computed the centroids for each cluster to de-
0.2 , r 0.3 FOR YLR194C
noted the most similar characteristics among data points
(u ij )m xi
cj i 1
n
(u ij )m
i 1 (3)
where m is the fuzzy parameter and n is the number of
genes.
2.2.4 Fuzzy membership update
The initial membership based on pre-defined minimal
value was given for each cluster member where the GO
annotation evidence code is unknown. Membership in-
uij gi cl j
itialization of of in is done so that
In the absence of GO annotation evidence code, the U (0)
[uij ] U (k )
[uij ] uij
following definition is used: . Then, is updated where
U ij ( o ) r , (2) denotes the probability of belongingness of pattern
xi to
In formula (1) and (2), both and r values need to be cl .The formula to update the fuzzy membership [6] is
assigned within the range 0 , r 1 . The assignment of given by:
U ij ( o ) uij
value is to check the reliability of in both gene uij nc
expression and GO annotation. Meanwhile, r is the re-
liability score when
g
does not have GO annotation evi-
u
k 1
ik
, (4)
dence code. The assignment of r is needed in order to 1
with the initial membership to handle a large amount of the final cluster to the total number of genes in the final
numeric data. The optimal cluster is determined by the cluster added with the number of genes not in the final
compactness and separation (CS) function obtained as cluster:
minimum CS value after pre-defined number of itera- rl
tions. The formula is:
recall , (9)
ar rl
n ncl
where ar refers to the number of genes in the initial cluster
u
i 1 j 1
ij
2
|| c j xi ||2 F-measure: This measure is defined as the harmonic mean
CS of pairwise precision and recall, where the traditional
n min || c j ci ||2 information retrieval measures are adapted for evaluating
ij
, (6) the accuracy of the clustering algorithm:
Subsequently, if the current CS is less than the mini- 2 precision recall
F measure , (10)
mum CS ( CS
*
), then CS CS ,
*
optimal cluster precision recall
C C , and optimal membership U U . These
k * k
For biological validation, the prediction of gene will take
steps are repeated until the algorithm reached the end of place. To do this, the predicted genes are thoroughly
cross checked with the latest 2009 annotation database
iteration, leaving no empty clusters. The C and U* from SGD to predict the unknown gene, eventually
are the output of the algorithm. proved our initial prediction with the current GO annota-
tion.
2.3 Analysis of results
The analysis of the results is divided into computational
evaluation and biological validation. In the computational 3 RESULTS AND DISCUSSION
evaluation, these measurements are considered: (i) com-
pactness and separation, (ii) consistency, and (iii) accura- 3.1 Evaluation on the Impact of the GO Term
cy (precision, recall, and F-measure). For CS evaluation, Categories
the quality of the cluster is measured in terms of the dis- We tested different categories of GO terms (biological
tribution of each gene in the cluster (compactness), and process, molecular function, and cellular component cat-
distribution of clusters among clusters (separation). egories) with SGD functional annotation using Eisen data
Meanwhile, the biological information significance is eva- to assess the comprehensiveness of the cluster output.
luated using consistency and accuracy measurement. Our main aim was to look into the effect of using differ-
Therefore, the results for cluster output in both cluster ent combination of GO term categories. Eisen was used as
quality and biological significance are evaluated. testing dataset while SGD functional annotation database
Compactness and separation: This measure is to deter- was used to examine the impact of different combination
mine the ratio of compactness within the cluster to sepa- of GO term categories. Fig. 3 shows that by using combi-
ration of cluster among other clusters [56, 17, 55] as de- nation of biological process, molecular function, and cel-
fined by Equation 6, which the smallest value of CS de- lular component GO term categories, the cluster pro-
notes the minimum intra-cluster and the maximum inter- duced the least CS and hence produced minimum intra-
cluster. cluster and maximum inter-clusters. It proved that com-
Consistency: This measure is to check the consistency of bining all GO term categories produced the best results.
the annotation in the output cluster by the definition be- By referring to this result, we assumed that MIPS and
low: Entrez functional annotation databases would also return
m the best results with the combination of all GO term cate-
CT 1 , (7) gories and at the same time, produce consistent clustering
n results. This is the reason why we used the combination
where for every cl , m refers the number of genes most of all GO term categories in the next evaluation.
frequently annotated in cl while n refers to the total
number of genes in cl . By using this definition, the 3.2 Evaluation on the Compactness and Separation
smaller CT represents the more consistent clusters. of the Cluster
Precision: This measure is the ratio of the number of We used fuzzy validity criterion, CS , which was
genes in the final cluster to the total of number of genes in adopted by Xie and Beni [50] to measure our clusters re-
the final cluster added with the number of genes that are sult in terms of the compactness (intra-cluster) and sepa-
not in the initial cluster: ration (inter-cluster). Table 3 and Table 4 show the values
rl of CS for the clusters using different values of and
precision , (8) r . The and r values are applied to determine the
ir rl gene membership values in each cluster. The lowest value
where rl refers to the number of genes in the final clus- of CS represents the best combination of compactness
ter, while ir refers to the number of genes that are not in and separation in the final result. As presented in both
the initial cluster. tables, the clustering results show the most compact clus-
Recall: This measure is the ratio of the number of genes in ter with furthest separation between the clusters when
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 6
The value in bold denotes the optimal CS for SGD (19.02), Entrez
(19.14), and MIPS (73.20).
TABLE 5
after each run are very small which is an evident of con- sion analysis involving organisms
sistency in the result. To further evaluate the result, we
also conducted tests for precision, recall, and F-measures TABLE 7
in comparison with Fang and Tari algorithms based on
COMPARISON ON THE PRECISION, RECALL, AND F-MEASURE OF
the application of biological knowledge (GO) and random THE OTHER CLUSTERING ALGORITHMS.
initialization. Our method that reduces the randomness
for the genes assignment was compared to the random
method used by Fang. We also compared it with Tari me-
thod that used only one GO term category and one func-
tional annotation database. Both Fang and Tari methods
used the same benchmark dataset which enable us to
properly interpret the results. In Table 7, we presented
our CluFA performance in precision, recall, and F-
measure with SGD, MIPS, and Entrez functional annota- TABLE 8
tion databases for Eisen and Gasch datasets. We used the
optimal CS value from Table 3 and Table 4 and it pro- AVERAGE NUMBER OF ANNOTATIONS RANDOMLY SAMPLED IN 30
REPETITIONS FROM THE ORIGINAL 21,615 ANNOTATIONS FROM
duced better results for recall and F-measure compared to EISEN DATASET.
Fang and Tari. In comparing with Tari algorithm, we
used their optimal values in Eisen dataset ( = 0.3 and
r = 0.2) while for Fang algorithm, default threshold value
0.21 were used. The comparative results of precision,
recall, and F-measures obtained for different algorithms
proved the accuracy of our CluFA. This leads us to con-
clude that our cluster initialization using the functional
annotation databases obtained by fuzzy c-means cluster-
ing algorithm gives better biological significance as com-
pared to other algorithms. TABLE 9
3.4 Evaluation on the Impact of the GO annotation
AVERAGE NUMBER OF ANNOTATIONS RANDOMLY SAMPLED IN 30
REPETITIONS FROM THE ORIGINAL 21,615 ANNOTATIONS FROM
In this study our method used the incorporation of GO GASCH DATASET
annotations in SGD, Entrez, and MIPS functional annota-
tion databases in order to reduce random initialization
during the membership assignment thus producing better
results. Therefore, we tested different percentages of
sample annotations in order to assess the impact of the
gene assignment on the original sample annotations. This
is done by calculating the average number of genes as-
signed to the optimal clusters with the original annota-
tions as our reference. We used a random number type
double in the range of 0 to 1 as a probability value for
each gene in the expression list. If the random number for
a gene is higher than the sampling value, then we will
include the gene in the sampling list. For example, if we
choose 0.25 as a sampling value for 75% sample annota-
tion, any genes with random number higher than 0.25
will be included in the sampling list. As illustrated in Fig.
5 and Fig. 6, we studied the number of genes assigned to
the individual clusters when 25%, 50%, and 75% from the
original annotations were used. The number of annota-
tions for different percentage of annotations is shown in
Table 8 and Table 9. Sampling of annotations taken from
Eisen and Gasch datasets depicted a uniform pattern in
number of genes assigned and the percentage of GO an- Fig. 5. The comparison results of the different annotation percentage
notations used. The results also show that consistency in from the functional annotation databases on Eisen dataset.
assignment of cluster members to correct GO clusters ac-
cording to their original full annotation can be achieved whose annotations are less bountiful. Furthermore, this
despite the various degrees of annotations used. Al- experiment covers multiple functional annotation data-
though yeast has bountiful annotations compared to oth- bases with separate run for each database. This shows
er species, our CluFA can also be used for gene expres- that our method can support different types of functional
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 8
database formats. which we used 76 terms to form 76 clusters. The ran‐
dom initialization has been reduced with the incorpo‐
ration of functional annotation databases (SGD, Entrez,
and MIPS). The GO annotation evidence code taken
from functional annotation databases was used to cal‐
culate the initial membership thus producing consis‐
tent results. Findings from the results proved that with
the incorporation of the functional annotation databas‐
es, the clusters are highly compacted with furthest se‐
paration between clusters. The GO terms in the clus‐
ters are also more consistent and accurate. The usage
of all GO term categories has been utilized to produce
Fig. 6. The comparison results of the different annotation percentage
from the functional annotation databases on Gasch dataset. comprehensive results which we used combination of
biological process, molecular function, and cellular
3.5 The Unknown Gene Functions Prediction
component in our semi‐supervised clustering process.
For the SGD functional annotation database collected in
It is proved that by applying all the GO term catego‐
2005, 4,799 genes are marked as not annotated in both
Eisen and Gasch datasets. To validate the predictions, ries, results are better in terms of compactness and se‐
some of the predicted results from CluFA were compared paration. Furthermore, our newly proposed computa‐
against the SGD, Entrez, and MIPS functional annotations tional framework can predict the gene functions by the
for the unknown genes. They provide literature from ma-
utilization of the GO terms in order to group the un‐
nually curated, high-throughput, and computational an-
notation to support the evidence assigned to the genes. known genes. There are a number of other issues that
One of the findings from the results shows that biological we are aware of, but are beyond the scope of this in‐
process in GO term category for gene ‘YIL064W’ has not vestigation. For example, the genes overlap among
been annotated and is assigned as No Biological Data clusters, which may induce confusion regarding the
Available (ND). However, CluFA managed to assign this
gene to three categories for both Eisen and Gasch datasets overall cluster assignments. This could be improved by
which are GO:0016192 (biological process; vesicle-mediated finding the most dominant cluster for a particular gene
transport), GO:0005737 (cellular component; cytoplasm), so that a gene will appear in only one cluster. Thus,
and GO:0016740 (molecular function; transferase activity). suggest the gene’s most dominant function. In the fu‐
According to Martin-Granados et al. [36], ‘YIL064W’ was
ture, we will extract the overlapping gene and assign it
assigned to GO:0016192 (biological process; vesicle-
mediated transport), GO:0005737 (cellular component; cy- to the most dominant group. Furthermore, our current
toplasm), and GO:0008757 (molecular function; S- computational framework can also be extended to oth‐
adenosylmethionine-dependent methyltransferase activity). It er biological knowledge such as pathway and protein‐
has been proved that CluFA can predict the unknown
protein interaction.
biological process term for gene ‘YIL064W’, confirmed by
the current SGD functional annotation. Table 10 presents
some of the results of the predicted terms. ACKNOWLEDGMENT
We would like to thank Rathiah Hashim from the Univer-
siti Tun Hussein Onn Malaysia for proofreading this
4 CONCLUSION
journal. The authors would like to thank Universiti Tun
The aim of this paper is to give an overview of the Hussein Onn Malaysia for supporting this research. This
weaknesses of existing clustering algorithms for gene project is funded by the Malaysian Ministry of Science,
expression, and to introduce a new computational Technology, and Innovation (MOSTI) under ScienceFund
grant no. 02-01-06-SF0068
framework named CluFA that utilizes prior know‐
ledge from the GO to overcome those weaknesses. REFERENCES
However, in order to show the strength of CluFA, [1] F. Angiulli, E. Cesario, C. Pizzuti, Random walk biclustering
three main drawbacks in the literature are discussed. for microarray data, Information Sciences, 178 (6) (2008) 1479‐
They are the number of clusters, random initialization, 1497.
and comprehensive GO terms and their solutions have [2] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler,
J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig,
been presented. The determination in the number of
M.A. Harris, D.P. Hill, L. Issel‐Tarver, A. Kasarskis, S. Lewis,
clusters has been solved using GO slim dataset in J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 9
TABLE 10 [5] M. Bereta, T. Burczyński, Immune K‐means and negative
selection algorithms for data analysis, Information Sciences,
PREDICTED ANNOTATION BY CLUFA VERSUS THE CURRENT AN- 179 (10) (2009) 1407‐1425.
NOTATION FROM SGD, ENTREZ, AND MIPS. [6] J.C. Bezdek, Pattern recognition with fuzzy objective function
algorithms. Kluwer Academic Publishers: Norwell, MA; 1981.
[7] J. Botet, L. Mateos, J.L. Revuelta, M.A. Santos, A chemoge‐
nomic screening of sulfanilamide‐hypersensitive saccharo‐
myces cerevisiae mutants uncovers ABZ2, the gene encoding
a fungal aminodeoxychorismate lyase, Eukaryot Cell, 6 (11)
(2007) 2102‐2111.
[8] M. Brameier, C. Wiuf, Co‐clustering and visualization of gene
expression data and gene ontology terms for S. cerevisiae us‐
ing self‐organizing maps, Journal of Biomedical Informatics,
40 (2) (2007) 160‐173.
[9] Y. Cheng, R.M. Miura, B. Tian, Prediction of mRNA polyade‐
nylation sites by support vector machine, Bioinformatics, 22
(19) (2006) 2320‐2325.
[10] J.M. Cherry, C. Adler, C. Ball, S.A. Chervitz, S.S. Dwight, E.T.
Hester, Y. Jia, G. Juvik, T. Roe, M. Schroeder, S. Weng, D.
Botstein, SGD: saccharomyces genome database, Nucleic Ac‐
ids Research, 26 (1) (1998) 73‐80.
[11] E. Choi, J.M. Dial, D.E. Jeong, M.C. Hall, Unique D box and
KEN box sequences limit ubiquitination of Acm1 and pro‐
mote pseudosubstrate inhibition of the anaphase‐promoting
complex, The Journal of Biological Chemistry, 283 (35) (2008)
23701‐23710.
[12] A. Clare, R.D. King, Predicting gene function in Saccharo‐
myces cerevisiae. Bioinformatics, 19 (Suppl 2) (2003) 42‐49.
[13] I.G. Costa, R. Krause, L. Optiz, A. Schliep, Semi‐supervised
learning for the identification of syn‐expressed genes from
fused microarray and in situ image data, BMC Bioinformat‐
ics, 8 (2007) S3.
[14] G. Cui, X. Cao, Y. Wang, L. Cao, B. Huang, C. Yang, Wavelet
packet decomposition‐based fuzzy clustering algorithm for
gene expression data, in: Proceedings of the Asia Pacific Con‐
ference on Circuits and Systems, 2006, pp.1027‐1030.
[15] P.A. DiMaggio Jr, S.R. McAllister, C.A. Floudas, X.J. Feng,
J.D. Rabinowitz,, H.A. Rabitz, Biclustering via optimal re‐
ordering of data matrices in systems biology: rigorous me‐
thods and comparative studies, BMC Bioinformatics, 9 (2008)
458.
[16] M.T. Dittrich, G.W. Klau, A. Rosenwald, T. Dandekar, T.
Muller, Identifying functional modules in protein‐protein in‐
teraction networks: An integrated exact approach, Bioinfor‐
matics, 24 (13) (2008) i223‐i231.
The value in the bracket for GO terms category denotes as follows; [17] R.O. Duda, P.E. Hart, D.G. Stork, Pattern classification. 2nd
(P): biological process, (F): molecular function, and (C): cellular Edition. John‐Wiley & Son Inc.: New York; 2001.
component. [18] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster
analysis and display of genome‐wide expression patterns, in:
Sherlock, Gene ontology: tool for the unification of biology, Proceedings of the National Academy of Sciences of the Unit‐
Nature Genetics, 25 (1) (2000) 25‐29. ed States of America, 1998, pp.14863‐14868.
[3] R. Balasubramaniyan, E. Hüllermeier, N. Weskamp, J. [19] M. Enquist‐Newman, M. Sullivan, D.O. Morgan, Modulation
Kämper, Clustering of gene expression data using a local of the Mitotic Regulatory Network by APC‐Dependent De‐
shape‐based similarity measure, Bioinformatics, 21 (7) (2005) struction of the Cdh1 Inhibitor Acm1, Molecular Cell, 30 (4)
1069‐1077. (2008) 437‐446.
[4] S. Bandyopadhyay, A. Mukhopadhyay, U. Maulik, An im‐ [20] R.M. Ewing, P. Chu, F. Elisma, H. Li, P. Taylor, S. Climie, L.
proved algorithm for clustering gene expression data, Bioin‐ McBroom‐Cerajewski, M.D. Robinson, L. O’Connor, M. Li, R.
formatics, 23 (21) (2007) 2859‐2865. Taylor, M. Dharsee, Y. Ho, A. Heilbut, L. Moore, S. Zhang, O.
Ornatsky, Y.V. Bukhman, M. Ethier, Y. Sheng, J. Vasilescu, M.
Abu‐Farha, J.P. Lambert, H.S. Duewel, I.I. Stewart, B. Kuehl,
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 10
[53] K. Tu, H. Yu, X.Y. Li, Combining gene expression profiles respectively. Currently, he is Director of Laboratory of Computational
and protein‐protein interaction data to infer gene functions, Intelligence and Biotechnology at the Universiti Teknologi Malaysia.
Journal of Biotechnology, 124 (3) (2006) 475‐485. His research interests include bioinformatics, artificial intelligence,
[54] R.J.P. van Berlo, L.F.A. Wessels, S.D.C. Martes, M.J.T. software agent, parallel computing, and web semantics. In March
Reinders, Predicting gene function by combining expression 2005, he was awarded the Young Researcher award by the Malay-
and interaction data, in: Proceedings of the IEEE Computa‐ sian Association of Research Scientists (MARS). Two of his inven-
tional Systems Bioinformatics Conference, 2005, pp.166‐167. tions, software products named 2D Engineering Drawing Extractor
[55] W. Wang, Y. Zhang, On fuzzy cluster validity indices, Fuzzy and 2D Design Structure Recognizer, have won 5 awards at the 21st
Sets and Systems, 158 (19) (2007) 2095‐2117. Invention and New Product Exposition held in Pittsburgh, USA in-
[56] X.L. Xie, G. Beni, A validity measure for fuzzy clustering. cluding the Best Invention of the Pacific Rim, and a gold medal
IEEE Transactions on Pattern Analysis and Machine Intelli‐ award at the 34th International Exhibition of Inventions of New Tech-
gence, 13 (8) (1991) 841‐847. niques and Products held in Geneva, Switzerland.
[57] D. Xutao, G. Huimin, A.H. Hesham, A hidden Markov model
approach to predicting yeast gene function from sequential Rathiah Hashim received her B.Sc. degree in Computer Science
gene expression data, International Journal of Bioinformatics from Wichita State University, USA, and M.Sc. degree in Computer
Research and Applications, 4 (3) (2008) 263‐273. Science from Universiti Teknologi Malaysia. She obtained her Ph.D.
[58] Y. Yuan, C.T. Li, R. Wilson, Partial mixture model for tight degree in visualization and psychology at Swansea University, UK.
clustering of gene expression time‐course, BMC Bioinformat‐ She is currently a Senior Lecturer at Faculty of Science Computer
ics, 9 (287) (2008) 1471‐2105. and Information Technology, Universiti Tun Hussein Onn Malaysia
[59] M.L. Zhang, J.M. Peña, V. Robles, Feature selection for multi‐ (UTHM). Her research area includes Video Visualization, Image
label naive Bayes classification, Information Sciences, 179 (19) Processing, Psychology (Visual Perception), and HCI.
(2009) 3218‐3229.
[60] M. Zhang, T. Therneau, M.A. McKenzie, P. Li, P. Yang, A
fuzzy c‐means algorithm using a correlation metrics and gene
ontology, in: Proceedings of the International Conference on
Pattern Recognition, 2008, pp.1‐4.
[61] D. Zhu, Semi‐supervised gene shaving method for predicting
low variation biological pathways from genome‐wide data,
BMC Bioinformatics, 10 (Supppl 1) (2009) S54.
[62] A. Zien, R. Küffner, R. Zimmer, T. Lengauer, Analysis of gene
expression data with pathway scores, in: Proceedings of the
International Conference on Intelligence System Molecular
Biology, 2000, pp. 407‐417.