Sunteți pe pagina 1din 11

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 1

Biological-based Semi-supervised Clustering


Algorithm to Improve Gene Function
Prediction
Shahreen Kasim, Safaai Deris, Razib M. Othman, and Rathiah Hashim

Abstract—Analysis of simultaneous clustering of gene expression with biological knowledge has now become an important
technique and standard practice to present a proper interpretation of the data and its underlying biology. However, common
clustering algorithms do not provide a comprehensive approach that look into the three categories of annotations; biological
process, molecular function, and cellular component, and were not tested with different functional annotation database formats.
Furthermore, the traditional clustering algorithms use random initialization which causes inconsistent cluster generation and are
unable to determine the number of clusters involved. In this paper, we present a novel computational framework called CluFA
(Clustering Functional Annotation) for semi-supervised clustering of gene expression data. The framework consists of three
stages: (i) preparation of Gene Ontology (GO) datasets, functional annotation databases, and testing datasets, (ii) a fuzzy c-
means clustering to find the optimal clusters; and (iii) analysis of computational evaluation and biological validation from the
results obtained. With combination of the three GO term categories (biological process, molecular function, and cellular
component) and functional annotation databases (Saccharomyces Genome Database (SGD), the Yeast Database at Munich
Information Centre for Protein Sequences (MIPS), and Entrez), the CluFA is able to determine the number of clusters and
reduce random initialization. In addition, CluFA is more comprehensive in its capability to predict the functions of unknown
genes. We tested our new computational framework for semi-supervised clustering of yeast gene expression data based on
multiple functional annotation databases. Experimental results show that 76 clusters have been identified via GO slim dataset.
By applying SGD, Entrez, and MIPS functional annotation database to reduce random initialization, performance on both
computational evaluation and biological validation were improved. By the usage of comprehensive GO term categories, the
lowest compactness and separation values were achieved. Therefore, from this experiment, we can conclude that CluFA had
improved the gene function prediction through the utilization of GO and gene expression values using the fuzzy c-means
clustering algorithm by cross referencing it with the latest SGD annotation.

Index Terms—Fuzzy c-means, Gene expression, Gene ontology, Gene function prediction, Semi-supervised clustering

——————————  ——————————

1 INTRODUCTION

G ene function prediction has become a popular and


challenging research area in Bioinformatics due to
the rising number of complete genome sequences.
considered to be more advantageous than classification as
genes are assigned randomly without any prior know-
ledge to group a set of genes into a number of mutually
This condition has triggered various experiments to pre- exclusive subsets. It can divide data into groups of genes
dict unknown genes from known gene functions (for ex- that share patterns of co-expression with a large amount
ample Zhang et al. [59], Mostafavi et al. [39], and Xutao et of variables in gene expression data.
al. [57]). These works are beneficial to our livelihood es- In recent development, many researchers used prior
pecially in areas such as disease treatment, gene and drug biological knowledge in their semi-supervised clustering
discovery. Two most common techniques for predicting technique for function prediction process [13, 61]. Among
gene functions in gene expression data are classification the popular biological knowledge are pathways [62, 26,
and clustering. The former technique can be regarded as 29, 44], protein-protein interaction [34, 53, 16, 40, 20], and
supervised learning where samples or classes are needed GO [2], [42, 27, 49, 21, 58, 51]. After a comprehensive lite-
to predict class label for future observation. It is a very rature review, GO is chosen for our framework because it
useful and accurate technique when a large number of is well structured and its rigorously controlled vocabu-
high quality samples are used to train the model. On the lary makes it applicable in in-silico analysis searches. It
other hand, the latter technique does not require samples produced a more comprehensive knowledge by provid-
or classes in training the model. This technique can be ing terms that are categorized into biological process, mo-
lecular function, and cellular component. These terms are
———————————————— the basic annotations of genes and gene products for all
 S. Kasim is with the Universiti Tun Hussein Onn Malaysia, 86400 Parit living organisms, used by researchers as single [35],
Raja, Batu Pahat, Malaysia. double [41], or multiple [23, 25] GO term categories.
 S. Deris is with the Universiti Teknologi Malaysia, 81310 UTM Skudai,
Malaysia.
However, the use of multiple GO term categories produc-
 R. M. Othman is with the Universiti Teknologi Malaysia, 81310 UTM es more accurate and exhaustive results.
Skudai, Malaysia. A wide range of GO had been implemented in several
 R. Hashim is with the Universiti Tun Hussein Onn Malaysia, 86400 Parit algorithms for clustering expression data such as k-means
Raja, Batu Pahat, Malaysia.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 2

[5, 52], Self Organizing Map (SOM) [8, 37], biclustering [1, 2 METHODS
15], and fuzzy c-means [60, 51, 4, 30]. The k-means algo-
CluFA consists of three stages as shown in Fig. 1. The first
rithm is used to partition number of objects into k clusters
stage is the preparation of data in which we used GO da-
in which each object belongs to the cluster with the near-
tasets, functional annotation databases, and testing data-
est mean. SOM creates a set of prototype vectors which
sets. These datasets are used in our semi-supervised clus-
represent the data and visualize the prototype. Bi-
tering process. In the second stage, fuzzy c-means cluster-
clustering algorithm refers to a subset of genes that be-
ing is implemented. It is carried out to assign each gene
have similarly in a subset of conditions. Fuzzy c-means
with the membership values according to their biological
algorithm calculates centroid of each cluster and gives a
cluster. Lastly, the final stage is the analysis of the results
degree of membership to each data. Fuzzy c-means algo-
in which two evaluation criteria were considered: (i)
rithm is considered superior to other algorithms in which
computational evaluation, and (ii) biological validation.
it gives a probability of belonging to each cluster to pro-
In the computational evaluation, we evaluated our results
ducing more accurate clusters [14, 22].
using a) compactness and separation, b) consistency, and
The popular usage of the GO for clustering gene ex-
c) accuracy. These evaluations are performed in order to
pression data has resulted in many experiments using
measure the ratio of compact to separate clusters. Moreo-
various species for examples Arabidopsis thaliana, Candida
ver, the consistency and accuracy evaluations are done to
albicans, Mus musculus, and S. cerevisiae. The baker’s yeast,
assign gene per annotation in each clusters. Meanwhile,
S. cerevisiae was the first eukaryote whose genome had
in biological validation, the unknown gene function is
been completely sequenced. There are several databases
predicted. In order to validate the predictions, the pre-
specifically dedicated to functional analysis of the yeast
dicted genes were cross checked against the SGD, Entrez,
genome which includes three major databases - the Sac-
and MIPS annotation. Below, we describe each of these
charomyces Genome Database (SGD) [10], the Yeast Da-
processes.
tabase at Munich Information Centre for Protein Se-
quences (MIPS) [38], and Entrez [47]. Works related to Preparation of Data Clustering Analysis of Result
these databases - SGD [8, 51, 21, 1], MIPS [12, 3, 5, 32, 45],
and Entrez [9, 41, 27] provide useful and most recent in-
formation to further develop yeast genome analysis, in- GO Datasets
(GOSlim, GO term)
Number of Cluster
Initialization
Computational
Evaluation
cluding periodically updated lists of proteins with known
or predicted functions, phenotypes of mutants (if availa- Functional
Annotation Databases Biological Validation
ble), protein-protein interactions, and gene expression (SGD, Entrez, MIPS) Fuzzy Membership
patterns. Initialization

Despite all the findings discussed above, there are still Testing Datasets
(Eisen, Gasch) Centroid Calculation
several drawbacks detected throughout the literature.
These drawbacks are:
Fuzzy Membership
 Number of clusters was not defined in the begin- Update
ning of the experiment.
 Usage of non-comprehensive GO term categories,
Fig 1. The computational framework of CluFA.
resulting in limited understanding of data.
 Only looks into one particular functional annota-
2.1 Preparation of Data
tion database [54, 46].
 Random assignment of gene for some clustering
algorithms [48, 33] causes inconsistent generation 2.1.1 GO Datasets
of clusters. Currently in the GO website (http://www.geneontolo
To overcome these drawbacks, we present a computa- gy.org), there are nearly 28,108 terms which refer to the
tional framework named CluFA (Clustering Functional controlled vocabulary used to describe gene and gene
Annotation) that uses fuzzy c-means as the clustering product attributes in any organism. These terms are clas-
algorithm. The computational framework consists of four sified as one of these three ontologies: cellular compo-
steps which are (i) initialization of cluster number, (ii) nent, biological process or molecular function. Each term
initialization of fuzzy membership, (iii) calculation of cen- is structured as a Directed Acyclic Graph (DAG). In this
troid, and (iv) fuzzy membership update. The novel con- study, we used GO slim and GO term dataset from older
tributions of CluFA are embedded in steps (i) and (ii). The
version in order to predict new gene functions. We ap-
GO slim was used to automatically defined the number of
plied GO slim dataset to form the initial clusters. The GO
clusters, thus overcoming the first above-mentioned
drawback. Furthermore, all GO terms categories were slim is a subset of GO terms in which some of the terms
used to produce efficient result and comprehensive out- are placed at a higher level in the GO hierarchy. In GO,
put to cover all terms in the GO. In step (ii), we used more there are GO slim for different organisms or usage. We
than one functional annotation databases, which are SGD, used the GO slim yeast dataset in OBO (Open Biomedical
MIPS, and Entrez from S. cerevisiae in order to reduce Ontologies) format, which was generated in September
random initialization during the assignment of cluster 2005. The GO slim yeast consists of 76 terms. Of these 76
membership. terms, there are 32 terms in biological process, 21 terms in
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 3

molecular function, and 23 terms in cellular component. GO:0046658 (cellular component; anchored to plasma mem-
To identify and assign each gene to its corresponding brane). For instance GO:0046658 has two parents under
cluster, GO term database in MySQL format was down- descendant of cl , GO:0005886 (cellular component; plas-
loaded (updated in September 2005). There are a total ma membrane) and GO:0016020 (cellular component; mem-
number of 19,458 terms and its relationship in this data- brane). The rest of the GO terms have only one parent. For
set. example, GO:0009277 has only one parent, GO:0005618
(cellular component; cell wall) and GO:0031505 has one
2.1.2 Functional Annotation Databases parent, GO:0007047 (biological process; cell wall organiza-
The functional annotation databases used in this experi- tion and biogenesis), and GO:0005199 has only one parent,
ment were downloaded from three different sources, GO:0005198 (molecular function; structural molecule activi-
ty). This example of cluster initialization is illustrated in
namely SGD, Entrez and MIPS. The SGD file used was
Fig. 2.
compiled in September 2005 comprises of 33,651 genes. It
is a scientific database of the molecular biology and ge- Gene ID Associated GO Term Related to GO Slim cluster
netics of the yeast S. cerevisiae. Entrez ‘gene2go’ file was
downloaded in June 2009, which consists of 52,351 genes.
GO:0009277 GO:0005618
It is known as an integrated search and retrieval system
that provides global queries for cross-databases. Lastly,
GO:0031505 GO:0007047
the MIPS file was used to provide a numeric, hierarchical YLR194C
system and to denote the various classes of biological
GO:0005199 GO:0005198
functions. In MIPS, two files were downloaded: ‘funcat-
2.1’ and ‘mips2go’, which were compiled in March 2007
and they consist of 15,924 genes. From these databases, GO:0046658 GO:0005886
we can also extract the GO annotation evidence code
from these databases to reduce the random initialization GO:0016020
of fuzzy membership.

2.1.3 Testing Datasets Fig. 2. An example of how gene YLR194C is assigned to its corres-
We test our CluFA on Eisen et al. [18] and Gasch et al. [24] ponding clusters.
of yeast gene expression datasets. The Eisen dataset con-
tains the expression profiles of 6,221 yeast genes with 80 2.2.2 Fuzzy membership initialization
samples taken during the diauxic shift, the mitotic cell
In the initial determination of membership, once the gene
division cycle, sporulation, temperature, and reducing
g has been assigned to its corresponding cl (s), the ini-
shocks. Meanwhile, the Gasch dataset contains 6,152
genes with 173 samples test on gene expression behaviour U (o) g is defined using the
tial membership value, ij for
during various stress conditions.
following formula:
U ij ( o )  rsij 1       r i  1,..., g n j  1,..., cll
2.2 Clustering , , (1)
where rsij is the reliability score according to GO annota-
2.2.1 Number of clusters initialization tion evidence code from GO annotation in SGD, Entrez,
In this step, a set of GO slim terms is used to initialize and MIPS which support g to be in cl . The use of con-
number of clusters. Given GO slim as GOall , we used all
the 76 terms in the GOall in which each term of stant  , is to give some variation in each iteration and r
the GOall form a separate cluster, cl . Once the clusters g
is a small constant to denote a level of reliability when i
have been determined, we assigned genes in the gene
has no GO annotation evidence code. The value of relia-
expression data for each of these clusters. The gene ex-
pression dataset is represented as G meanwhile each bility score is retrieved by assigning a weight of GO anno-
gene in G dataset is represented as g where g  G . Let tation evidence code based on the hierarchy of reliability
t be a descendant of cl in a GO hierarchy where from the GO between 0 and 1 in which number closest to
cl  GOall , each g in t is assigned to its corresponding 1 being the most reliable. During the assignment of initial
cl (s) where there can be more than one cl for each g. This membership, the most reliable score is chosen when there
is due to the structure of the GO which is a DAG (Direct exists more than one GO annotation evidence code as-
Acyclic Graph). For example, gene ‘YLR194C’ has four signed to a particular gene.
GO terms which are GO:0009277 (cellular component; The GO annotation evidence code and its reliability score are 
chitin- and beta-glucan-containing cell wall), GO:0031505 stated as follows: Inferred from Electronic Annotation (IEA: 0.5), 
(biological process; chitin- and beta-glucan-containing cell
Non‐traceable  Author  Statement  (NAS:  0.6),  Inferred  from  Re‐
wall organization and biogenesis), GO:0005199 (molecular
viewed  Computational  Analysis  (RCA:  0.7),  Inferred  from  Se‐
function; structural constituent of cell wall), and
quence or Structural Similarity (ISS: 0.7), Inferred by Curator (IC: 
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 4

0.7),  Inferred  from  Expression  Pattern  (IEP:  0.7),  Inferred  from  TABLE 2
Physical  Interaction  (IPI:  0.8),  Inferred  from  Mutant  Phenotype 
THE DISTRIBUTION NUMBER OF GENES WITH GO ANNOTATION
(IMP:  0.8),  Inferred  from  Genetic  Interaction  (IGI:  0.8),  Inferred 
EVIDENCE CODE AND WITHOUT GO ANNOTATION EVIDENCE
from  Direct  Assay  (IDA:  0.9),  and  Traceable  Author  Statement  CODE.
(TAS: 0.9). Next, the functional annotation databases are ap‐
plied  in  which  we  checked  their  GO  annotation  evidence 
code through SGD, MIPS, and Entrez gene separately. From 
these  three  databases,  we  matched  the  gene  with  its  respec‐
tive  rsij .  An  example  of  initial  membership  assignment  for 
these dataset is shown in Table 1. 

TABLE 1
(o )
2.2.3 Centroid calculation
AN EXAMPLE OF U INITIAL MEMBERSHIP ASSIGNMENT WHEN
ij
Next, we computed the centroids for each cluster to de-
  0.2 , r  0.3 FOR YLR194C
noted the most similar characteristics among data points

based on Bezdek et al. [6]. Assuming


xi is a vector ex-
gi , the fuzzy centroid, C  [c j ]
(k )
pression values for is

a vector set for k  number of iteration, j  1,..., cll , is


quantified as:
n

 (u ij )m xi
cj  i 1
n

 (u ij )m
i 1 (3)
where m is the fuzzy parameter and n is the number of
genes.
2.2.4 Fuzzy membership update
The initial membership based on pre-defined minimal
value was given for each cluster member where the GO
annotation evidence code is unknown. Membership in-
uij gi cl j
itialization of of in is done so that
In the absence of GO annotation evidence code, the U (0)
 [uij ] U (k )
 [uij ] uij
following definition is used: . Then, is updated where
U ij ( o )    r , (2) denotes the probability of belongingness of pattern
xi to
In formula (1) and (2), both  and r values need to be cl .The formula to update the fuzzy membership [6] is
assigned within the range 0   , r  1 . The assignment of given by:

 U ij ( o ) uij
value is to check the reliability of in both gene uij  nc
expression and GO annotation. Meanwhile, r is the re-
liability score when
g
does not have GO annotation evi-
u
k 1
ik

, (4)
dence code. The assignment of r is needed in order to 1

assign the unknown genes to t based on their expression  1  ( m 1)


 
 || xi  c j || 
(o)
U
patterns. The initial membership values ij , will influ- uij  1
ence the final results thus improving the final member- nc
  ( m 1)
1
ship values when g has GO annotation evidence code.  
j 1  || x  c ||
  uij (0)
The distribution number of genes with or without GO i j 
where , (5)
annotation evidence code for SGD, MIPS, and Entrez da- (0)
tasets are shown in Table 2. u
in which the role of ij in the denominator of equation
(5) is to normalize the value of the membership update
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 5

with the initial membership to handle a large amount of the final cluster to the total number of genes in the final
numeric data. The optimal cluster is determined by the cluster added with the number of genes not in the final
compactness and separation (CS) function obtained as cluster:
minimum CS value after pre-defined number of itera- rl
tions. The formula is:
recall  , (9)
ar  rl
n ncl
where ar refers to the number of genes in the initial cluster
 u
i 1 j 1
ij
2
|| c j  xi ||2 F-measure: This measure is defined as the harmonic mean
CS  of pairwise precision and recall, where the traditional
n min || c j  ci ||2 information retrieval measures are adapted for evaluating
ij
, (6) the accuracy of the clustering algorithm:
Subsequently, if the current CS is less than the mini- 2  precision  recall
F  measure  , (10)
mum CS ( CS
*
), then CS  CS ,
*
optimal cluster precision  recall

C C , and optimal membership U  U . These
k * k
For biological validation, the prediction of gene will take
steps are repeated until the algorithm reached the end of place. To do this, the predicted genes are thoroughly
cross checked with the latest 2009 annotation database
iteration, leaving no empty clusters. The C and U* from SGD to predict the unknown gene, eventually
are the output of the algorithm. proved our initial prediction with the current GO annota-
tion.
2.3 Analysis of results
The analysis of the results is divided into computational
evaluation and biological validation. In the computational 3 RESULTS AND DISCUSSION
evaluation, these measurements are considered: (i) com-
pactness and separation, (ii) consistency, and (iii) accura- 3.1 Evaluation on the Impact of the GO Term
cy (precision, recall, and F-measure). For CS evaluation, Categories
the quality of the cluster is measured in terms of the dis- We tested different categories of GO terms (biological
tribution of each gene in the cluster (compactness), and process, molecular function, and cellular component cat-
distribution of clusters among clusters (separation). egories) with SGD functional annotation using Eisen data
Meanwhile, the biological information significance is eva- to assess the comprehensiveness of the cluster output.
luated using consistency and accuracy measurement. Our main aim was to look into the effect of using differ-
Therefore, the results for cluster output in both cluster ent combination of GO term categories. Eisen was used as
quality and biological significance are evaluated. testing dataset while SGD functional annotation database
Compactness and separation: This measure is to deter- was used to examine the impact of different combination
mine the ratio of compactness within the cluster to sepa- of GO term categories. Fig. 3 shows that by using combi-
ration of cluster among other clusters [56, 17, 55] as de- nation of biological process, molecular function, and cel-
fined by Equation 6, which the smallest value of CS de- lular component GO term categories, the cluster pro-
notes the minimum intra-cluster and the maximum inter- duced the least CS and hence produced minimum intra-
cluster. cluster and maximum inter-clusters. It proved that com-
Consistency: This measure is to check the consistency of bining all GO term categories produced the best results.
the annotation in the output cluster by the definition be- By referring to this result, we assumed that MIPS and
low: Entrez functional annotation databases would also return
m the best results with the combination of all GO term cate-
CT  1  , (7) gories and at the same time, produce consistent clustering
n results. This is the reason why we used the combination
where for every cl , m refers the number of genes most of all GO term categories in the next evaluation.
frequently annotated in cl while n refers to the total
number of genes in cl . By using this definition, the 3.2 Evaluation on the Compactness and Separation
smaller CT represents the more consistent clusters. of the Cluster
Precision: This measure is the ratio of the number of We used fuzzy validity criterion, CS , which was
genes in the final cluster to the total of number of genes in adopted by Xie and Beni [50] to measure our clusters re-
the final cluster added with the number of genes that are sult in terms of the compactness (intra-cluster) and sepa-
not in the initial cluster: ration (inter-cluster). Table 3 and Table 4 show the values
rl of CS for the clusters using different values of  and
precision  , (8) r . The  and r values are applied to determine the
ir  rl gene membership values in each cluster. The lowest value
where rl refers to the number of genes in the final clus- of CS represents the best combination of compactness
ter, while ir refers to the number of genes that are not in and separation in the final result. As presented in both
the initial cluster. tables, the clustering results show the most compact clus-
Recall: This measure is the ratio of the number of genes in ter with furthest separation between the clusters when 
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 6

= 0.3 and r = 0.2 using SGD functional annotation for TABLE 3


both Eisen and Gasch datasets. It is also shown that in
Entrez functional annotation for Eisen dataset, the cluster- RESULT OF THE EVALUATION ON THE COMPACTNESS AND SE-
ing results show the most compact cluster with furthest PARATION OF THE CLUSTER USING EISEN DATASET.

separation between the clusters when  = 0.3 and r =


0.2 while  = 0.2 and r = 0.2 for Gasch dataset. Mean-
while, the clustering results for the MIPS functional anno-
tation database show the most compact cluster with fur-
thest separation between the clusters when  = 0.1 and
r = 0.1 using SGD functional annotation for both Eisen
and Gasch datasets. The results also showed that there
was only an average of 2.5% for the gene to change in The value in bold denotes the optimal CS for SGD (10.10), Entrez (11.30),
comparison with their initial cluster. The results con- and MIPS (62.41).
firmed that applying different values of  and r influ-
TABLE 4
ence the reliability score and gene expression values dur-
ing the calculation of membership assignment which
RESULT OF THE EVALUATION ON THE COMPACTNESS AND SE-
eventually produce highly compact clusters with maxi- PARATION OF THE CLUSTER USING EISEN DATASET.
mum separation between clusters.

The value in bold denotes the optimal CS for SGD (19.02), Entrez
(19.14), and MIPS (73.20).

TABLE 5

RESULT OF EVALUATION ON THE CONSISTENCY OF THE CLUS-


TERS USING EISEN DATASET.
Fig. 4. The performance of single, double and multiple combination
of GO term categories of SGD functional annotation database in
Eisen dataset.

3.3 Evaluation on the Biological Significance of the


Cluster
To further evaluate the reliability of our clusters result
with the functional annotation, we use our own
CT measure. The consistency as well as inconsistency for
number of genes annotated by GO terms in the cluster The value in bold denotes the optimal consistency for SGD (0.19),
can be determined using this measurement. The smallest Entrez (0.08), and MIPS (0.24).
value of CT represents a more powerful and consistent
TABLE 6
result. As presented in Table 5 (using Eisen dataset), our
clustering result achieved the lowest CT value when  RESULT OF EVALUATION ON THE CONSISTENCY OF THE CLUS-
= 0.5 and r = 0.3 for SGD and Entrez functional annota- TERS USING EISEN DATASET.
tion. Meanwhile, for MIPS functional annotation, our
clustering result produced the lowest CT value when 
= 0.7 and r = 0.3. In Table 6 (using Gasch dataset), our
clustering result achieved the lowest CT value when 
= 0.5 and r = 0.4 for SGD functional annotation while for
Entrez functional annotation, the lowest CT value is
achieved when  = 0.5 and r = 0.3. For MIPS functional
annotation, our clustering result produced the lowest
CT value when  = 0.7 and r = 0.3. The dashes (‘-’) in
Table 5 and Table 6 are due to the clusters containing zero The value in bold denotes the optimal consistency for SGD (0.12),
member of genes. This happened when the membership Entrez (0.08), and MIPS (0.22).
values are beyond the threshold range and the CT val-
ues cannot be calculated. Based on this evaluation, the
differences of the CT values for the various  and r
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 7

after each run are very small which is an evident of con- sion analysis involving organisms
sistency in the result. To further evaluate the result, we
also conducted tests for precision, recall, and F-measures TABLE 7
in comparison with Fang and Tari algorithms based on
COMPARISON ON THE PRECISION, RECALL, AND F-MEASURE OF
the application of biological knowledge (GO) and random THE OTHER CLUSTERING ALGORITHMS.
initialization. Our method that reduces the randomness
for the genes assignment was compared to the random
method used by Fang. We also compared it with Tari me-
thod that used only one GO term category and one func-
tional annotation database. Both Fang and Tari methods
used the same benchmark dataset which enable us to
properly interpret the results. In Table 7, we presented
our CluFA performance in precision, recall, and F-
measure with SGD, MIPS, and Entrez functional annota- TABLE 8
tion databases for Eisen and Gasch datasets. We used the
optimal CS value from Table 3 and Table 4 and it pro- AVERAGE NUMBER OF ANNOTATIONS RANDOMLY SAMPLED IN 30
REPETITIONS FROM THE ORIGINAL 21,615 ANNOTATIONS FROM
duced better results for recall and F-measure compared to EISEN DATASET.
Fang and Tari. In comparing with Tari algorithm, we
used their optimal values in Eisen dataset (  = 0.3 and
r = 0.2) while for Fang algorithm, default threshold value
0.21 were used. The comparative results of precision,
recall, and F-measures obtained for different algorithms
proved the accuracy of our CluFA. This leads us to con-
clude that our cluster initialization using the functional
annotation databases obtained by fuzzy c-means cluster-
ing algorithm gives better biological significance as com-
pared to other algorithms. TABLE 9
3.4 Evaluation on the Impact of the GO annotation
AVERAGE NUMBER OF ANNOTATIONS RANDOMLY SAMPLED IN 30
REPETITIONS FROM THE ORIGINAL 21,615 ANNOTATIONS FROM
In this study our method used the incorporation of GO GASCH DATASET
annotations in SGD, Entrez, and MIPS functional annota-
tion databases in order to reduce random initialization
during the membership assignment thus producing better
results. Therefore, we tested different percentages of
sample annotations in order to assess the impact of the
gene assignment on the original sample annotations. This
is done by calculating the average number of genes as-
signed to the optimal clusters with the original annota-
tions as our reference. We used a random number type
double in the range of 0 to 1 as a probability value for
each gene in the expression list. If the random number for
a gene is higher than the sampling value, then we will
include the gene in the sampling list. For example, if we
choose 0.25 as a sampling value for 75% sample annota-
tion, any genes with random number higher than 0.25
will be included in the sampling list. As illustrated in Fig.
5 and Fig. 6, we studied the number of genes assigned to
the individual clusters when 25%, 50%, and 75% from the
original annotations were used. The number of annota-
tions for different percentage of annotations is shown in
Table 8 and Table 9. Sampling of annotations taken from
Eisen and Gasch datasets depicted a uniform pattern in
number of genes assigned and the percentage of GO an- Fig. 5. The comparison results of the different annotation percentage
notations used. The results also show that consistency in from the functional annotation databases on Eisen dataset.
assignment of cluster members to correct GO clusters ac-
cording to their original full annotation can be achieved whose annotations are less bountiful. Furthermore, this
despite the various degrees of annotations used. Al- experiment covers multiple functional annotation data-
though yeast has bountiful annotations compared to oth- bases with separate run for each database. This shows
er species, our CluFA can also be used for gene expres- that our method can support different types of functional
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 8

database formats. which  we  used  76  terms  to  form  76  clusters.  The  ran‐
dom initialization has been reduced with the incorpo‐
ration of functional annotation databases (SGD, Entrez, 
and  MIPS).  The  GO  annotation  evidence  code  taken 
from functional annotation databases was used to cal‐
culate  the  initial  membership  thus  producing  consis‐
tent results. Findings from the results proved that with 
the incorporation of the functional annotation databas‐
es, the clusters are highly compacted with furthest se‐
paration  between  clusters.  The  GO  terms  in  the  clus‐
ters are also more consistent and accurate.  The usage 
of all GO term categories has been utilized to produce 
Fig. 6. The comparison results of the different annotation percentage
from the functional annotation databases on Gasch dataset. comprehensive  results  which  we  used  combination  of 
biological  process,  molecular  function,  and  cellular 
3.5 The Unknown Gene Functions Prediction
component  in  our  semi‐supervised  clustering  process. 
For the SGD functional annotation database collected in
It  is  proved  that  by  applying  all  the  GO  term  catego‐
2005, 4,799 genes are marked as not annotated in both
Eisen and Gasch datasets. To validate the predictions, ries, results are better in terms of compactness and se‐
some of the predicted results from CluFA were compared paration.  Furthermore,  our  newly  proposed  computa‐
against the SGD, Entrez, and MIPS functional annotations tional framework can predict the gene functions by the 
for the unknown genes. They provide literature from ma-
utilization  of  the  GO  terms  in  order  to  group  the  un‐
nually curated, high-throughput, and computational an-
notation to support the evidence assigned to the genes. known genes. There are a number of other issues that 
One of the findings from the results shows that biological we  are  aware  of,  but  are  beyond  the  scope  of  this  in‐
process in GO term category for gene ‘YIL064W’ has not vestigation.  For  example,  the  genes  overlap  among 
been annotated and is assigned as No Biological Data clusters,  which  may  induce  confusion  regarding  the 
Available (ND). However, CluFA managed to assign this
gene to three categories for both Eisen and Gasch datasets overall cluster assignments. This could be improved by 
which are GO:0016192 (biological process; vesicle-mediated finding the most dominant cluster for a particular gene 
transport), GO:0005737 (cellular component; cytoplasm), so  that  a  gene  will  appear  in  only  one  cluster.  Thus, 
and GO:0016740 (molecular function; transferase activity). suggest  the  gene’s  most  dominant  function.  In  the  fu‐
According to Martin-Granados et al. [36], ‘YIL064W’ was
ture, we will extract the overlapping gene and assign it 
assigned to GO:0016192 (biological process; vesicle-
mediated transport), GO:0005737 (cellular component; cy- to the most dominant group. Furthermore, our current 
toplasm), and GO:0008757 (molecular function; S- computational framework can also be extended to oth‐
adenosylmethionine-dependent methyltransferase activity). It er biological knowledge such as pathway and protein‐
has been proved that CluFA can predict the unknown
protein interaction. 
biological process term for gene ‘YIL064W’, confirmed by
the current SGD functional annotation. Table 10 presents
some of the results of the predicted terms. ACKNOWLEDGMENT
We would like to thank Rathiah Hashim from the Univer-
siti Tun Hussein Onn Malaysia for proofreading this
4 CONCLUSION
journal. The authors would like to thank Universiti Tun
The  aim  of  this  paper  is  to  give  an  overview  of  the  Hussein Onn Malaysia for supporting this research. This
weaknesses  of  existing  clustering  algorithms  for  gene  project is funded by the Malaysian Ministry of Science,
expression,  and  to  introduce  a  new  computational  Technology, and Innovation (MOSTI) under ScienceFund
grant no. 02-01-06-SF0068
framework  named  CluFA  that  utilizes  prior  know‐
ledge  from  the  GO  to  overcome  those  weaknesses.  REFERENCES
However,  in  order  to  show  the  strength  of  CluFA,  [1] F. Angiulli, E. Cesario, C. Pizzuti, Random walk biclustering 
three  main  drawbacks  in  the  literature  are  discussed.  for microarray data, Information Sciences, 178 (6) (2008) 1479‐
They are the number of clusters, random initialization,  1497. 
and comprehensive GO terms and their solutions have  [2] M.  Ashburner,  C.A.  Ball,  J.A.  Blake,  D.  Botstein,  H.  Butler, 
J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, 
been  presented.  The  determination  in  the  number  of 
M.A. Harris, D.P. Hill, L. Issel‐Tarver, A. Kasarskis, S. Lewis, 
clusters  has  been  solved  using  GO  slim  dataset  in  J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G.  
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 9

TABLE 10 [5] M.  Bereta,  T.  Burczyński,  Immune  K‐means  and  negative 
selection  algorithms  for  data  analysis,  Information  Sciences, 
PREDICTED ANNOTATION BY CLUFA VERSUS THE CURRENT AN- 179 (10) (2009) 1407‐1425. 
NOTATION FROM SGD, ENTREZ, AND MIPS. [6] J.C. Bezdek, Pattern recognition with fuzzy objective function 
algorithms. Kluwer Academic Publishers: Norwell, MA; 1981. 
[7] J.  Botet,  L.  Mateos,  J.L.  Revuelta,  M.A.  Santos,  A  chemoge‐
nomic  screening  of  sulfanilamide‐hypersensitive  saccharo‐
myces cerevisiae mutants uncovers ABZ2, the gene encoding 
a  fungal  aminodeoxychorismate  lyase,  Eukaryot  Cell,  6  (11) 
(2007) 2102‐2111. 
[8] M. Brameier, C. Wiuf, Co‐clustering and visualization of gene 
expression data and gene ontology terms for S. cerevisiae us‐
ing  self‐organizing  maps,  Journal  of  Biomedical  Informatics, 
40 (2) (2007) 160‐173. 
[9] Y. Cheng, R.M. Miura, B. Tian, Prediction of mRNA polyade‐
nylation  sites  by  support  vector  machine,  Bioinformatics,  22 
(19) (2006) 2320‐2325. 
[10] J.M. Cherry, C. Adler, C. Ball, S.A. Chervitz, S.S. Dwight, E.T. 
Hester,  Y.  Jia,  G.  Juvik,  T.  Roe,  M.  Schroeder,  S.  Weng,  D. 
Botstein, SGD: saccharomyces genome database, Nucleic Ac‐
ids Research, 26 (1) (1998) 73‐80. 
[11] E.  Choi,  J.M.  Dial,  D.E.  Jeong,  M.C.  Hall,  Unique  D  box  and 
KEN  box  sequences  limit  ubiquitination  of  Acm1  and  pro‐
mote  pseudosubstrate  inhibition  of  the  anaphase‐promoting 
complex, The Journal of Biological Chemistry, 283 (35) (2008) 
23701‐23710. 
[12] A.  Clare,  R.D.  King,  Predicting  gene  function  in  Saccharo‐
myces cerevisiae. Bioinformatics, 19 (Suppl 2) (2003) 42‐49. 
[13] I.G.  Costa,  R.  Krause,  L.  Optiz,  A.  Schliep,  Semi‐supervised 
learning  for  the  identification  of  syn‐expressed  genes  from 
fused  microarray  and  in  situ  image  data,  BMC  Bioinformat‐
ics, 8 (2007) S3. 
[14] G. Cui, X. Cao, Y. Wang, L. Cao, B. Huang, C. Yang, Wavelet 
packet  decomposition‐based  fuzzy  clustering  algorithm  for 
gene expression data, in: Proceedings of the Asia Pacific Con‐
ference on Circuits and Systems, 2006, pp.1027‐1030. 
[15] P.A.  DiMaggio  Jr,  S.R.  McAllister,  C.A.  Floudas,  X.J.  Feng, 
J.D.  Rabinowitz,,  H.A.  Rabitz,  Biclustering  via  optimal  re‐
ordering  of  data  matrices  in  systems  biology:  rigorous  me‐
thods and comparative studies, BMC Bioinformatics, 9 (2008) 
458. 
[16] M.T.  Dittrich,  G.W.  Klau,  A.  Rosenwald,  T.  Dandekar,  T. 
Muller, Identifying functional modules in protein‐protein in‐
teraction  networks:  An  integrated  exact  approach,  Bioinfor‐
matics, 24 (13) (2008) i223‐i231. 
 
The value in the bracket for GO terms category denotes as follows; [17] R.O.  Duda,  P.E.  Hart,  D.G.  Stork,  Pattern  classification.  2nd 
(P): biological process, (F): molecular function, and (C): cellular Edition. John‐Wiley & Son Inc.: New York; 2001. 
component. [18] M.B.  Eisen,  P.T.  Spellman,  P.O.  Brown,  D.  Botstein,  Cluster 
  analysis and display of genome‐wide expression patterns, in: 
Sherlock,  Gene  ontology:  tool  for  the  unification  of  biology,  Proceedings of the National Academy of Sciences of the Unit‐
Nature Genetics, 25 (1) (2000) 25‐29.  ed States of America, 1998, pp.14863‐14868. 
[3] R.  Balasubramaniyan,  E.  Hüllermeier,  N.  Weskamp,  J.  [19] M. Enquist‐Newman, M. Sullivan, D.O. Morgan, Modulation 
Kämper,  Clustering  of  gene  expression  data  using  a  local  of  the  Mitotic  Regulatory  Network  by  APC‐Dependent  De‐
shape‐based  similarity  measure,  Bioinformatics,  21  (7)  (2005)  struction  of  the  Cdh1  Inhibitor  Acm1,  Molecular  Cell,  30  (4) 
1069‐1077.  (2008) 437‐446. 
[4] S.  Bandyopadhyay,  A.  Mukhopadhyay,  U.  Maulik,  An  im‐ [20] R.M. Ewing, P. Chu, F. Elisma, H. Li, P. Taylor, S. Climie, L. 
proved  algorithm  for  clustering  gene  expression  data,  Bioin‐ McBroom‐Cerajewski, M.D. Robinson, L. O’Connor, M. Li, R. 
formatics, 23 (21) (2007) 2859‐2865.  Taylor, M. Dharsee, Y. Ho, A. Heilbut, L. Moore, S. Zhang, O. 
Ornatsky, Y.V. Bukhman, M. Ethier, Y. Sheng, J. Vasilescu, M. 
Abu‐Farha, J.P. Lambert, H.S. Duewel, I.I. Stewart, B. Kuehl, 
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 10

K. Hogue, K. Colwill, K. Gladwish, B. Muskat, R. Kinach, S.L.  tein,  confers  both  deficient  heterologous  protein  production 


Adams,  M.F.  Moran,  G.B.  Morin,  T.  Topaloglou,  D.  Figeys,  and endocytosis, Yeast, 25 (12) (2008) 871‐877. 
Large‐scale  mapping  of  human  protein‐protein  interactions  [37] K.  McGarry,  M.  Sarfraz,  J.  MacIntyre,  Integrating  gene  ex‐
by  mass  spectrometry,  Molecular  Systems  Biology,  3  (2007)  pression data from microarrays using the self‐organising map 
89.  and the gene ontology, in: Proceedings of the 4th IAPR Inter‐
[21] Z. Fang, Y. Li, Q. Luo, L. Liu, Knowledge guided analysis of  national Conference on Pattern Recognition in Bioinformatics, 
microarray  data.  Journal  of  Biomedical  Informatics,  39  (4)  2007, pp. 206‐217. 
(2006) 401‐411.  [38] H.W. Mewes, K. Heumann, A. Kaps, K. Mayer, F. Pfeiffer, S. 
[22] L.  Fu,  E.  Medico,  FLAME,  a  novel  fuzzy  clustering  method  Stocker, D. Frishman, MIPS: a database for genomes and pro‐
for  the  analysis  of  DNA  microarray  data,  BMC  Bioinformat‐ tein sequences. Nucleic Acids Research, 30 (1) (2002) 31‐34. 
ics, 8 (2007) 3.  [39] S. Mostafavi, D. Ray, D. Warde‐Farley, C. Grouios, Q. Morris, 
[23] G.  Gamberoni,  S.  Storari,  S.  Volinia,  Finding  biological  GeneMANIA:  a  real‐time  multiple  association  network  inte‐
process  modifications  in  cancer  tissues  by  mining  gene  ex‐ gration algorithm for predicting gene function, Genome Biol‐
pression correlations, BMC Bioinformatics, 7 (2006) 6.  ogy, 9 (2008) S4. 
[24] A.P.  Gasch,  P.T.  Spellman,  C.M.  Kao,  O.  Carmel‐Harel,  M.B.  [40] S. Nacu, R. Critchley‐Thorne, P. Lee, S. Holmes, Gene expres‐
Eisen, G. Storz, D. Botstein, P.O. Brown, Genomic expression  sion  network  analysis  and  applications  to  immunology,  Bio‐
programs  in  the  response  of  yeast  cells  to  environmental  informatics, 23 (7) (2007) 850‐858. 
changes,  Molecular  Biology  of  the  Cell,  11  (12)  (2000)  4241‐ [41] J.  Natarajan,  J.  Ganapathy,  Functional  gene  clustering  via 
4257.  gene  annotation  sentences,  MeSH  and  GO  keywords  from 
[25] A. Ghouila, S.B. Yahia, D. Malouche, H. Jmel, D. Laouini, F.Z.  biomedical literature, Bioinformation, 2 (5) (2007) 185‐193. 
Guerfali,  S.  Abdelhak,  Application  of  multi‐SOM  clustering  [42] K.  Ovaska,  M.  Laakso,  S.  Hautaniemi,  Fast  gene  ontology 
approach to macrophage gene expression analysis, Infection,  based clustering for microarray experiments, BioData Mining, 
Genetics and Evolution, 9 (3) (2009) 328‐336.  1 (2008) 11. 
[26] D.  Hanisch,  A.  Zien,  R.  Zimmer,  T.  Lengauer,  Co‐clustering  [43] D.  Ostapenko,  J.L.  Burton,  R.  Wang,  M.J.  Solomon,  Pseudo‐
of  biological  networks  and  gene  expression  data,  Bioinfor‐ substrate  inhibition  of  the  anaphase‐promoting  complex  by 
matics, 18 (Suppl 1) (2002) 145‐154.  Acm1: regulation by proteolysis and Cdc28 phosphorylation, 
[27] T.J.  Hestilow,  Y.  Huang,  Clustering  of  gene  expression  data  Molecular and Cellular Biology, 28 (15) (2008) 4653‐4664. 
based on shape similarity, Journal on Bioinformatics and Sys‐ [44] H.  Pang,  H.  Zhao,  Building  pathway  clusters  from  random 
tems Biology, 2009 (1) (2009) 195712‐195724.  forests classification using class votes, BMC Bioinformatics, 9 
[28] C. Hlynialuk, R. Schierholtz, A. Vernooy, G. van der Merwe,  (2008) 87. 
Nsf1/Ypl230w  participates  in  transcriptional  activation  dur‐ [45] S.S. Ray, S. Bandyopadhyay, S.K. Pal, Gene ordering in parti‐
ing non‐fermentative growth and in response to salt stress in  tive  clustering  using  microarray  expressions,  Journal  Bios‐
saccharomyces  cerevisiae,  Microbiology,  154  (8)  (2008)  2482‐ cience, 32 (5) (2007) 1019‐1025. 
2491.  [46] S.S.  Ray,  S.  Bandyopadhyay,  S.K.  Pal,  Combining  multi‐
[29] J.  Ihmels,  G.  Friedlander,  S.  Bergmann,  O.  Sarig,  Y.  Ziv,  N.  source  information  through  functional  annotation  based 
Barkai,  Revealing  modular  organization  in  the  yeast  tran‐ weighting: gene function prediction in yeast, IEEE Transanc‐
scriptional network, Nature Genetics, 31 (4) (2002) 370‐377.  tions on Biomedical Engineering, 56 (2) (2009) 229‐236.  
[30] D.W. Kim, B.Y. Kang, Iterative clustering analysis for group‐ [47] G.D.  Schuler,  J.A.  Epstein,  H.  Ohkawa,  J.A.  Kans,  Entrez: 
ing missing data in gene expression profiles, in: Proceedings  molecular  biology  database  and  retrieval  system,  Methods 
of the 10th Pacific‐Asia Conference on Knowledge Discovery  Enzymol, 266 (1996) 141‐162.  
and Data Mining, 2006, pp.129‐138.  [48] A.  Sharma,  R.  Podolsky,  J.  Zhao,  R.A.  McIndoe,  A  modified 
[31] S.A. Krause, H. Xu, J.V. Gray, The synthetic genetic network  hyperplane clustering algorithm allows for efficient and accu‐
around PKC1 identifies novel modulators and components of  rate clustering of extremely large datasets, Bioinformatics, 25 
protein kinase C signaling in saccharomyces cerevisiae, Euka‐ (9) (2009) 1152‐1157. 
ryot Cell, 7 (11) (2008) 1880‐1887.  [49] S.  Srivastava,  L.  Zhang,  R.  Jin,  C.  Chan,  A  novel  method  in‐
[32] X.L.  Li,  Y.C.  Tan,  S.K.  Ng,  Systematic  gene  function  predic‐ corporating  gene  ontology  information  for  unsupervised 
tion  from  gene  expression  data  by  using  a  fuzzy  nearest‐ clustering  and  feature  selection,  PLoS  ONE,  3  (12)  (2008) 
cluster method, BMC Bioinformatics, 7 (Suppl 4) (2006) S23.  e3860.  
[33] A.  Li,  D.  Tuck,  An  effective  tri‐clustering  algorithm  combin‐ [50] C.P.  Tanya,  G.C.  Steven,  Computational  methods  to  identify 
ing  expression  data  with  gene  regulation  information.  Gene  novel methyltransferases, BMC Bioinformatics, 10 (Suppl 13) 
Regulation and Systems Biology, 3 (2009) 49‐64.  (2009) P7. 
[34] G.  Li,  Z  Wang,  Incorporating  protein‐protein  interactions  [51] L. Tari, C. Baral, S. Kim, Fuzzy c‐means clustering with prior 
knowledge  in  clustering  gene  expression  data,  in:  Proceed‐ biological  knowledge,  Journal  of  Biomedical  Informatics,  42 
ings  of  the  International  Conference  on  Bioinformatics  and  (1) (2009) 74‐81. 
Biomedical Engineering, 2008, pp. 207‐210.  [52] G.C.  Tseng,  Penalized  and  weighted  k‐means  for  clustering 
[35] Z. Lu, L. Hunter, GO molecular function terms are predictive  with  scattered  objects  and  prior  information  in  high‐
of subcellular localization, in: Proceedings of the Pacific Sym‐ throughput  biological  data,  Bioinformatics,  23  (17)  (2007) 
posium, 2005, pp. 151‐161.   2247‐2255. 
[36] C. Martin‐Granados, S.P. Riechers, U. Stahl, C. Lang, Absence 
of  see1p,  a  widely  conserved  saccharomyces  cerevisiae  pro‐
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 11

[53] K.  Tu,  H.  Yu,  X.Y.  Li,  Combining  gene  expression  profiles  respectively. Currently, he is Director of Laboratory of Computational
and  protein‐protein  interaction  data  to  infer  gene  functions,  Intelligence and Biotechnology at the Universiti Teknologi Malaysia.
Journal of Biotechnology, 124 (3) (2006) 475‐485.  His research interests include bioinformatics, artificial intelligence,
[54] R.J.P.  van  Berlo,  L.F.A.  Wessels,    S.D.C.  Martes,    M.J.T.  software agent, parallel computing, and web semantics. In March
Reinders,  Predicting  gene  function  by  combining  expression  2005, he was awarded the Young Researcher award by the Malay-
and  interaction  data,  in:  Proceedings  of  the  IEEE  Computa‐ sian Association of Research Scientists (MARS). Two of his inven-
tional Systems Bioinformatics Conference, 2005, pp.166‐167.  tions, software products named 2D Engineering Drawing Extractor
[55] W. Wang, Y. Zhang, On fuzzy cluster validity indices, Fuzzy  and 2D Design Structure Recognizer, have won 5 awards at the 21st
Sets and Systems, 158 (19) (2007) 2095‐2117.  Invention and New Product Exposition held in Pittsburgh, USA in-
[56] X.L.  Xie,  G.  Beni,  A  validity  measure  for  fuzzy  clustering.  cluding the Best Invention of the Pacific Rim, and a gold medal
IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelli‐ award at the 34th International Exhibition of Inventions of New Tech-
gence, 13 (8) (1991) 841‐847.  niques and Products held in Geneva, Switzerland.
[57] D. Xutao, G. Huimin, A.H. Hesham, A hidden Markov model 
approach  to  predicting  yeast  gene  function  from  sequential  Rathiah Hashim received her B.Sc. degree in Computer Science
gene expression data, International Journal of Bioinformatics  from Wichita State University, USA, and M.Sc. degree in Computer
Research and Applications, 4 (3) (2008) 263‐273.   Science from Universiti Teknologi Malaysia. She obtained her Ph.D.
[58] Y.  Yuan,  C.T.  Li,  R.  Wilson,  Partial  mixture  model  for  tight  degree in visualization and psychology at Swansea University, UK.
clustering of gene expression time‐course, BMC Bioinformat‐ She is currently a Senior Lecturer at Faculty of Science Computer
ics, 9 (287) (2008) 1471‐2105.  and Information Technology, Universiti Tun Hussein Onn Malaysia
[59] M.L. Zhang, J.M. Peña, V. Robles, Feature selection for multi‐ (UTHM). Her research area includes Video Visualization, Image
label naive Bayes classification, Information Sciences, 179 (19)  Processing, Psychology (Visual Perception), and HCI.
(2009) 3218‐3229. 
[60] M.  Zhang,  T.  Therneau,  M.A.  McKenzie,  P.  Li,  P.  Yang, A 
fuzzy c‐means algorithm using a correlation metrics and gene 
ontology,  in:  Proceedings  of  the  International  Conference  on 
Pattern Recognition, 2008, pp.1‐4. 
[61] D. Zhu, Semi‐supervised gene shaving method for predicting 
low  variation  biological  pathways  from  genome‐wide  data, 
BMC Bioinformatics, 10 (Supppl 1) (2009) S54.  
[62] A. Zien, R. Küffner, R. Zimmer, T. Lengauer, Analysis of gene 
expression  data  with  pathway  scores,  in:  Proceedings  of  the 
International  Conference  on  Intelligence  System  Molecular 
Biology, 2000, pp. 407‐417. 

Shahreen Kasim is a doctoral candidate at the Faculty of Computer


Science and Information Systems, the Universiti Teknologi Malaysia.
She received BSc degrees in Computer Science in 2003, and MSc
degree in Information Technology in 2005 both from the Universiti
Teknologi Malaysia. She is also a reviewer at Journal of Bioinformat-
ics. Her research interests include bioinformatics, gene expression
analysis, and preprocessing methods such as dimension reduction
and missing value algorithms. Currently, she is a tutor at Faculty of
Science Computer and Information Technology, Universiti Tun Hus-
sein Onn Malaysia (UTHM).

Safaai Deris is a Professor of Artificial Intelligence and Software


Engineering at the Faculty of Computer Science and Information
System, Deputy Dean at the School of Graduate Studies, and Direc-
tor of Laboratory of Artificial Intelligence and Bioinformatics at the
Universiti Teknologi Malaysia. He received the MEng degree in In-
dustrial Engineering, and the DEng degree in Computer and System
Sciences, both from the Osaka Prefecture University, Japan, in 1989
and 1997 respectively. His recent academic interests include the
application and development of intelligent techniques in planning,
scheduling, and bioinformatics.

Razib M. Othman is a Senior Lecturer at the Faculty of Computer


Science and Information Systems, the Universiti Teknologi Malaysia.
He received the BSc, MSc, and PhD degrees in Computer Science
from the Universiti Teknologi Malaysia, in 1999, 2003, and 2008

S-ar putea să vă placă și