Biological-Based Semi-Supervised Clustering Algorithm To Improve Gene Function Prediction

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 1
Biological-based Semi-supervised Clustering

Algorithm to Improve Gene Function
Prediction
Shahreen Kasim, Safaai Deris, Razib M. Othman, and Rathiah Hashim
Abstract—Analysis of simultaneous clustering of gene expression with biological knowledge has now become an important
technique and standard practice to present a proper interpretation of the data and its underlying biology. However, common
clustering algorithms do not provide a comprehensive approach that look into the three categories of annotations; biological
process, molecular function, and cellular component, and were not tested with different functional annotation database formats.
Furthermore, the traditional clustering algorithms use random initialization which causes inconsistent cluster generation and are
unable to determine the number of clusters involved. In this paper, we present a novel computational framework called CluFA
(Clustering Functional Annotation) for semi-supervised clustering of gene expression data. The framework consists of three
stages: (i) preparation of Gene Ontology (GO) datasets, functional annotation databases, and testing datasets, (ii) a fuzzy c-
means clustering to find the optimal clusters; and (iii) analysis of computational evaluation and biological validation from the
results obtained. With combination of the three GO term categories (biological process, molecular function, and cellular
component) and functional annotation databases (Saccharomyces Genome Database (SGD), the Yeast Database at Munich
Information Centre for Protein Sequences (MIPS), and Entrez), the CluFA is able to determine the number of clusters and
reduce random initialization. In addition, CluFA is more comprehensive in its capability to predict the functions of unknown
genes. We tested our new computational framework for semi-supervised clustering of yeast gene expression data based on
multiple functional annotation databases. Experimental results show that 76 clusters have been identified via GO slim dataset.
By applying SGD, Entrez, and MIPS functional annotation database to reduce random initialization, performance on both
computational evaluation and biological validation were improved. By the usage of comprehensive GO term categories, the
lowest compactness and separation values were achieved. Therefore, from this experiment, we can conclude that CluFA had
improved the gene function prediction through the utilization of GO and gene expression values using the fuzzy c-means
clustering algorithm by cross referencing it with the latest SGD annotation.
Index Terms—Fuzzy c-means, Gene expression, Gene ontology, Gene function prediction, Semi-supervised clustering
——————————  ——————————
1 INTRODUCTION
G ene function prediction has become a popular and

challenging research area in Bioinformatics due to
the rising number of complete genome sequences.
considered to be more advantageous than classification as
genes are assigned randomly without any prior know-
ledge to group a set of genes into a number of mutually
This condition has triggered various experiments to pre- exclusive subsets. It can divide data into groups of genes
dict unknown genes from known gene functions (for ex- that share patterns of co-expression with a large amount
ample Zhang et al. [59], Mostafavi et al. [39], and Xutao et of variables in gene expression data.
al. [57]). These works are beneficial to our livelihood es- In recent development, many researchers used prior
pecially in areas such as disease treatment, gene and drug biological knowledge in their semi-supervised clustering
discovery. Two most common techniques for predicting technique for function prediction process [13, 61]. Among
gene functions in gene expression data are classification the popular biological knowledge are pathways [62, 26,
and clustering. The former technique can be regarded as 29, 44], protein-protein interaction [34, 53, 16, 40, 20], and
supervised learning where samples or classes are needed GO [2], [42, 27, 49, 21, 58, 51]. After a comprehensive lite-
to predict class label for future observation. It is a very rature review, GO is chosen for our framework because it
useful and accurate technique when a large number of is well structured and its rigorously controlled vocabu-
high quality samples are used to train the model. On the lary makes it applicable in in-silico analysis searches. It
other hand, the latter technique does not require samples produced a more comprehensive knowledge by provid-
or classes in training the model. This technique can be ing terms that are categorized into biological process, mo-
lecular function, and cellular component. These terms are
———————————————— the basic annotations of genes and gene products for all
 S. Kasim is with the Universiti Tun Hussein Onn Malaysia, 86400 Parit living organisms, used by researchers as single [35],
Raja, Batu Pahat, Malaysia. double [41], or multiple [23, 25] GO term categories.
 S. Deris is with the Universiti Teknologi Malaysia, 81310 UTM Skudai,
Malaysia.
However, the use of multiple GO term categories produc-
 R. M. Othman is with the Universiti Teknologi Malaysia, 81310 UTM es more accurate and exhaustive results.
Skudai, Malaysia. A wide range of GO had been implemented in several
 R. Hashim is with the Universiti Tun Hussein Onn Malaysia, 86400 Parit algorithms for clustering expression data such as k-means
Raja, Batu Pahat, Malaysia.
[5, 52], Self Organizing Map (SOM) [8, 37], biclustering [1, 2 METHODS
15], and fuzzy c-means [60, 51, 4, 30]. The k-means algo-
CluFA consists of three stages as shown in Fig. 1. The first
rithm is used to partition number of objects into k clusters
stage is the preparation of data in which we used GO da-
in which each object belongs to the cluster with the near-
tasets, functional annotation databases, and testing data-
est mean. SOM creates a set of prototype vectors which
sets. These datasets are used in our semi-supervised clus-
represent the data and visualize the prototype. Bi-
tering process. In the second stage, fuzzy c-means cluster-
clustering algorithm refers to a subset of genes that be-
ing is implemented. It is carried out to assign each gene
have similarly in a subset of conditions. Fuzzy c-means
with the membership values according to their biological
algorithm calculates centroid of each cluster and gives a
cluster. Lastly, the final stage is the analysis of the results
degree of membership to each data. Fuzzy c-means algo-
in which two evaluation criteria were considered: (i)
rithm is considered superior to other algorithms in which
computational evaluation, and (ii) biological validation.
it gives a probability of belonging to each cluster to pro-
In the computational evaluation, we evaluated our results
ducing more accurate clusters [14, 22].
using a) compactness and separation, b) consistency, and
The popular usage of the GO for clustering gene ex-
c) accuracy. These evaluations are performed in order to
pression data has resulted in many experiments using
measure the ratio of compact to separate clusters. Moreo-
various species for examples Arabidopsis thaliana, Candida
ver, the consistency and accuracy evaluations are done to
albicans, Mus musculus, and S. cerevisiae. The baker’s yeast,
assign gene per annotation in each clusters. Meanwhile,
S. cerevisiae was the first eukaryote whose genome had
in biological validation, the unknown gene function is
been completely sequenced. There are several databases
predicted. In order to validate the predictions, the pre-
specifically dedicated to functional analysis of the yeast
dicted genes were cross checked against the SGD, Entrez,
genome which includes three major databases - the Sac-
and MIPS annotation. Below, we describe each of these
charomyces Genome Database (SGD) [10], the Yeast Da-
processes.
tabase at Munich Information Centre for Protein Se-
quences (MIPS) [38], and Entrez [47]. Works related to Preparation of Data Clustering Analysis of Result
these databases - SGD [8, 51, 21, 1], MIPS [12, 3, 5, 32, 45],
and Entrez [9, 41, 27] provide useful and most recent in-
formation to further develop yeast genome analysis, in- GO Datasets
(GOSlim, GO term)
Number of Cluster
Initialization
Computational
Evaluation
cluding periodically updated lists of proteins with known
or predicted functions, phenotypes of mutants (if availa- Functional
Annotation Databases Biological Validation
ble), protein-protein interactions, and gene expression (SGD, Entrez, MIPS) Fuzzy Membership
patterns. Initialization
Despite all the findings discussed above, there are still Testing Datasets
(Eisen, Gasch) Centroid Calculation
several drawbacks detected throughout the literature.
These drawbacks are:
Fuzzy Membership
 Number of clusters was not defined in the begin- Update
ning of the experiment.
 Usage of non-comprehensive GO term categories,
Fig 1. The computational framework of CluFA.
resulting in limited understanding of data.
 Only looks into one particular functional annota-
2.1 Preparation of Data
tion database [54, 46].
 Random assignment of gene for some clustering
algorithms [48, 33] causes inconsistent generation 2.1.1 GO Datasets
of clusters. Currently in the GO website (http://www.geneontolo
To overcome these drawbacks, we present a computa- gy.org), there are nearly 28,108 terms which refer to the
tional framework named CluFA (Clustering Functional controlled vocabulary used to describe gene and gene
Annotation) that uses fuzzy c-means as the clustering product attributes in any organism. These terms are clas-
algorithm. The computational framework consists of four sified as one of these three ontologies: cellular compo-
steps which are (i) initialization of cluster number, (ii) nent, biological process or molecular function. Each term
initialization of fuzzy membership, (iii) calculation of cen- is structured as a Directed Acyclic Graph (DAG). In this
troid, and (iv) fuzzy membership update. The novel con- study, we used GO slim and GO term dataset from older
tributions of CluFA are embedded in steps (i) and (ii). The
version in order to predict new gene functions. We ap-
GO slim was used to automatically defined the number of
plied GO slim dataset to form the initial clusters. The GO
clusters, thus overcoming the first above-mentioned
drawback. Furthermore, all GO terms categories were slim is a subset of GO terms in which some of the terms
used to produce efficient result and comprehensive out- are placed at a higher level in the GO hierarchy. In GO,
put to cover all terms in the GO. In step (ii), we used more there are GO slim for different organisms or usage. We
than one functional annotation databases, which are SGD, used the GO slim yeast dataset in OBO (Open Biomedical
MIPS, and Entrez from S. cerevisiae in order to reduce Ontologies) format, which was generated in September
random initialization during the assignment of cluster 2005. The GO slim yeast consists of 76 terms. Of these 76
membership. terms, there are 32 terms in biological process, 21 terms in
molecular function, and 23 terms in cellular component. GO:0046658 (cellular component; anchored to plasma mem-
To identify and assign each gene to its corresponding brane). For instance GO:0046658 has two parents under
cluster, GO term database in MySQL format was down- descendant of cl , GO:0005886 (cellular component; plas-
loaded (updated in September 2005). There are a total ma membrane) and GO:0016020 (cellular component; mem-
number of 19,458 terms and its relationship in this data- brane). The rest of the GO terms have only one parent. For
set. example, GO:0009277 has only one parent, GO:0005618
(cellular component; cell wall) and GO:0031505 has one
2.1.2 Functional Annotation Databases parent, GO:0007047 (biological process; cell wall organiza-
The functional annotation databases used in this experi- tion and biogenesis), and GO:0005199 has only one parent,
ment were downloaded from three different sources, GO:0005198 (molecular function; structural molecule activi-
ty). This example of cluster initialization is illustrated in
namely SGD, Entrez and MIPS. The SGD file used was
Fig. 2.
compiled in September 2005 comprises of 33,651 genes. It
is a scientific database of the molecular biology and ge- Gene ID Associated GO Term Related to GO Slim cluster
netics of the yeast S. cerevisiae. Entrez ‘gene2go’ file was
downloaded in June 2009, which consists of 52,351 genes.
GO:0009277 GO:0005618
It is known as an integrated search and retrieval system
that provides global queries for cross-databases. Lastly,
GO:0031505 GO:0007047
the MIPS file was used to provide a numeric, hierarchical YLR194C
system and to denote the various classes of biological
GO:0005199 GO:0005198
functions. In MIPS, two files were downloaded: ‘funcat-
2.1’ and ‘mips2go’, which were compiled in March 2007
and they consist of 15,924 genes. From these databases, GO:0046658 GO:0005886
we can also extract the GO annotation evidence code
from these databases to reduce the random initialization GO:0016020
of fuzzy membership.
2.1.3 Testing Datasets Fig. 2. An example of how gene YLR194C is assigned to its corres-
We test our CluFA on Eisen et al. [18] and Gasch et al. [24] ponding clusters.
of yeast gene expression datasets. The Eisen dataset con-
tains the expression profiles of 6,221 yeast genes with 80 2.2.2 Fuzzy membership initialization
samples taken during the diauxic shift, the mitotic cell
In the initial determination of membership, once the gene
division cycle, sporulation, temperature, and reducing
g has been assigned to its corresponding cl (s), the ini-
shocks. Meanwhile, the Gasch dataset contains 6,152
genes with 173 samples test on gene expression behaviour U (o) g is defined using the
tial membership value, ij for
during various stress conditions.
following formula:
U ij ( o )  rsij 1       r i  1,..., g n j  1,..., cll
2.2 Clustering , , (1)
where rsij is the reliability score according to GO annota-
2.2.1 Number of clusters initialization tion evidence code from GO annotation in SGD, Entrez,
In this step, a set of GO slim terms is used to initialize and MIPS which support g to be in cl . The use of con-
number of clusters. Given GO slim as GOall , we used all
the 76 terms in the GOall in which each term of stant  , is to give some variation in each iteration and r
the GOall form a separate cluster, cl . Once the clusters g
is a small constant to denote a level of reliability when i
have been determined, we assigned genes in the gene
has no GO annotation evidence code. The value of relia-
expression data for each of these clusters. The gene ex-
pression dataset is represented as G meanwhile each bility score is retrieved by assigning a weight of GO anno-
gene in G dataset is represented as g where g  G . Let tation evidence code based on the hierarchy of reliability
t be a descendant of cl in a GO hierarchy where from the GO between 0 and 1 in which number closest to
cl  GOall , each g in t is assigned to its corresponding 1 being the most reliable. During the assignment of initial
cl (s) where there can be more than one cl for each g. This membership, the most reliable score is chosen when there
is due to the structure of the GO which is a DAG (Direct exists more than one GO annotation evidence code as-
Acyclic Graph). For example, gene ‘YLR194C’ has four signed to a particular gene.
GO terms which are GO:0009277 (cellular component; The GO annotation evidence code and its reliability score are
chitin- and beta-glucan-containing cell wall), GO:0031505 stated as follows: Inferred from Electronic Annotation (IEA: 0.5),
(biological process; chitin- and beta-glucan-containing cell
Non‐traceable Author Statement (NAS: 0.6), Inferred from Re‐
wall organization and biogenesis), GO:0005199 (molecular
viewed Computational Analysis (RCA: 0.7), Inferred from Se‐
function; structural constituent of cell wall), and
quence or Structural Similarity (ISS: 0.7), Inferred by Curator (IC:
0.7), Inferred from Expression Pattern (IEP: 0.7), Inferred from TABLE 2
Physical Interaction (IPI: 0.8), Inferred from Mutant Phenotype
THE DISTRIBUTION NUMBER OF GENES WITH GO ANNOTATION
(IMP: 0.8), Inferred from Genetic Interaction (IGI: 0.8), Inferred
EVIDENCE CODE AND WITHOUT GO ANNOTATION EVIDENCE
from Direct Assay (IDA: 0.9), and Traceable Author Statement CODE.
(TAS: 0.9). Next, the functional annotation databases are ap‐
plied in which we checked their GO annotation evidence
code through SGD, MIPS, and Entrez gene separately. From
these three databases, we matched the gene with its respec‐
tive rsij . An example of initial membership assignment for
these dataset is shown in Table 1.
TABLE 1
(o )
2.2.3 Centroid calculation
AN EXAMPLE OF U INITIAL MEMBERSHIP ASSIGNMENT WHEN
ij
Next, we computed the centroids for each cluster to de-
  0.2 , r  0.3 FOR YLR194C
noted the most similar characteristics among data points
based on Bezdek et al. [6]. Assuming

xi is a vector ex-
gi , the fuzzy centroid, C  [c j ]
(k )
pression values for is
a vector set for k  number of iteration, j  1,..., cll , is

quantified as:
n
 (u ij )m xi
cj  i 1
n
 (u ij )m
i 1 (3)
where m is the fuzzy parameter and n is the number of
genes.
2.2.4 Fuzzy membership update
The initial membership based on pre-defined minimal
value was given for each cluster member where the GO
annotation evidence code is unknown. Membership in-
uij gi cl j
itialization of of in is done so that
In the absence of GO annotation evidence code, the U (0)
 [uij ] U (k )
 [uij ] uij
following definition is used: . Then, is updated where
U ij ( o )    r , (2) denotes the probability of belongingness of pattern
xi to
In formula (1) and (2), both  and r values need to be cl .The formula to update the fuzzy membership [6] is
assigned within the range 0   , r  1 . The assignment of given by:
 U ij ( o ) uij
value is to check the reliability of in both gene uij  nc
expression and GO annotation. Meanwhile, r is the re-
liability score when
g
does not have GO annotation evi-
u
k 1
ik

, (4)
dence code. The assignment of r is needed in order to 1
assign the unknown genes to t based on their expression  1  ( m 1)

 
 || xi  c j || 
(o)
U
patterns. The initial membership values ij , will influ- uij  1
ence the final results thus improving the final member- nc
  ( m 1)
1
ship values when g has GO annotation evidence code.  
j 1  || x  c ||
  uij (0)
The distribution number of genes with or without GO i j 
where , (5)
annotation evidence code for SGD, MIPS, and Entrez da- (0)
tasets are shown in Table 2. u
in which the role of ij in the denominator of equation
(5) is to normalize the value of the membership update
with the initial membership to handle a large amount of the final cluster to the total number of genes in the final
numeric data. The optimal cluster is determined by the cluster added with the number of genes not in the final
compactness and separation (CS) function obtained as cluster:
minimum CS value after pre-defined number of itera- rl
tions. The formula is:
recall  , (9)
ar  rl
n ncl
where ar refers to the number of genes in the initial cluster
 u
i 1 j 1
ij
2
|| c j  xi ||2 F-measure: This measure is defined as the harmonic mean
CS  of pairwise precision and recall, where the traditional
n min || c j  ci ||2 information retrieval measures are adapted for evaluating
ij
, (6) the accuracy of the clustering algorithm:
Subsequently, if the current CS is less than the mini- 2  precision  recall
F  measure  , (10)
mum CS ( CS
*
), then CS  CS ,
*
optimal cluster precision  recall

C C , and optimal membership U  U . These
k * k
For biological validation, the prediction of gene will take
steps are repeated until the algorithm reached the end of place. To do this, the predicted genes are thoroughly
cross checked with the latest 2009 annotation database
iteration, leaving no empty clusters. The C and U* from SGD to predict the unknown gene, eventually
are the output of the algorithm. proved our initial prediction with the current GO annota-
tion.
2.3 Analysis of results
The analysis of the results is divided into computational
evaluation and biological validation. In the computational 3 RESULTS AND DISCUSSION
evaluation, these measurements are considered: (i) com-
pactness and separation, (ii) consistency, and (iii) accura- 3.1 Evaluation on the Impact of the GO Term
cy (precision, recall, and F-measure). For CS evaluation, Categories
the quality of the cluster is measured in terms of the dis- We tested different categories of GO terms (biological
tribution of each gene in the cluster (compactness), and process, molecular function, and cellular component cat-
distribution of clusters among clusters (separation). egories) with SGD functional annotation using Eisen data
Meanwhile, the biological information significance is eva- to assess the comprehensiveness of the cluster output.
luated using consistency and accuracy measurement. Our main aim was to look into the effect of using differ-
Therefore, the results for cluster output in both cluster ent combination of GO term categories. Eisen was used as
quality and biological significance are evaluated. testing dataset while SGD functional annotation database
Compactness and separation: This measure is to deter- was used to examine the impact of different combination
mine the ratio of compactness within the cluster to sepa- of GO term categories. Fig. 3 shows that by using combi-
ration of cluster among other clusters [56, 17, 55] as de- nation of biological process, molecular function, and cel-
fined by Equation 6, which the smallest value of CS de- lular component GO term categories, the cluster pro-
notes the minimum intra-cluster and the maximum inter- duced the least CS and hence produced minimum intra-
cluster. cluster and maximum inter-clusters. It proved that com-
Consistency: This measure is to check the consistency of bining all GO term categories produced the best results.
the annotation in the output cluster by the definition be- By referring to this result, we assumed that MIPS and
low: Entrez functional annotation databases would also return
m the best results with the combination of all GO term cate-
CT  1  , (7) gories and at the same time, produce consistent clustering
n results. This is the reason why we used the combination
where for every cl , m refers the number of genes most of all GO term categories in the next evaluation.
frequently annotated in cl while n refers to the total
number of genes in cl . By using this definition, the 3.2 Evaluation on the Compactness and Separation
smaller CT represents the more consistent clusters. of the Cluster
Precision: This measure is the ratio of the number of We used fuzzy validity criterion, CS , which was
genes in the final cluster to the total of number of genes in adopted by Xie and Beni [50] to measure our clusters re-
the final cluster added with the number of genes that are sult in terms of the compactness (intra-cluster) and sepa-
not in the initial cluster: ration (inter-cluster). Table 3 and Table 4 show the values
rl of CS for the clusters using different values of  and
precision  , (8) r . The  and r values are applied to determine the
ir  rl gene membership values in each cluster. The lowest value
where rl refers to the number of genes in the final clus- of CS represents the best combination of compactness
ter, while ir refers to the number of genes that are not in and separation in the final result. As presented in both
the initial cluster. tables, the clustering results show the most compact clus-
Recall: This measure is the ratio of the number of genes in ter with furthest separation between the clusters when 
= 0.3 and r = 0.2 using SGD functional annotation for TABLE 3

both Eisen and Gasch datasets. It is also shown that in
Entrez functional annotation for Eisen dataset, the cluster- RESULT OF THE EVALUATION ON THE COMPACTNESS AND SE-
ing results show the most compact cluster with furthest PARATION OF THE CLUSTER USING EISEN DATASET.
separation between the clusters when  = 0.3 and r =

0.2 while  = 0.2 and r = 0.2 for Gasch dataset. Mean-
while, the clustering results for the MIPS functional anno-
tation database show the most compact cluster with fur-
thest separation between the clusters when  = 0.1 and
r = 0.1 using SGD functional annotation for both Eisen
and Gasch datasets. The results also showed that there
was only an average of 2.5% for the gene to change in The value in bold denotes the optimal CS for SGD (10.10), Entrez (11.30),
comparison with their initial cluster. The results con- and MIPS (62.41).
firmed that applying different values of  and r influ-
TABLE 4
ence the reliability score and gene expression values dur-
ing the calculation of membership assignment which
RESULT OF THE EVALUATION ON THE COMPACTNESS AND SE-
eventually produce highly compact clusters with maxi- PARATION OF THE CLUSTER USING EISEN DATASET.
mum separation between clusters.
The value in bold denotes the optimal CS for SGD (19.02), Entrez
(19.14), and MIPS (73.20).
TABLE 5
RESULT OF EVALUATION ON THE CONSISTENCY OF THE CLUS-

TERS USING EISEN DATASET.
Fig. 4. The performance of single, double and multiple combination
of GO term categories of SGD functional annotation database in
Eisen dataset.
3.3 Evaluation on the Biological Significance of the

Cluster
To further evaluate the reliability of our clusters result
with the functional annotation, we use our own
CT measure. The consistency as well as inconsistency for
number of genes annotated by GO terms in the cluster The value in bold denotes the optimal consistency for SGD (0.19),
can be determined using this measurement. The smallest Entrez (0.08), and MIPS (0.24).
value of CT represents a more powerful and consistent
TABLE 6
result. As presented in Table 5 (using Eisen dataset), our
clustering result achieved the lowest CT value when  RESULT OF EVALUATION ON THE CONSISTENCY OF THE CLUS-
= 0.5 and r = 0.3 for SGD and Entrez functional annota- TERS USING EISEN DATASET.
tion. Meanwhile, for MIPS functional annotation, our
clustering result produced the lowest CT value when 
= 0.7 and r = 0.3. In Table 6 (using Gasch dataset), our
clustering result achieved the lowest CT value when 
= 0.5 and r = 0.4 for SGD functional annotation while for
Entrez functional annotation, the lowest CT value is
achieved when  = 0.5 and r = 0.3. For MIPS functional
annotation, our clustering result produced the lowest
CT value when  = 0.7 and r = 0.3. The dashes (‘-’) in
Table 5 and Table 6 are due to the clusters containing zero The value in bold denotes the optimal consistency for SGD (0.12),
member of genes. This happened when the membership Entrez (0.08), and MIPS (0.22).
values are beyond the threshold range and the CT val-
ues cannot be calculated. Based on this evaluation, the
differences of the CT values for the various  and r
after each run are very small which is an evident of con- sion analysis involving organisms
sistency in the result. To further evaluate the result, we
also conducted tests for precision, recall, and F-measures TABLE 7
in comparison with Fang and Tari algorithms based on
COMPARISON ON THE PRECISION, RECALL, AND F-MEASURE OF
the application of biological knowledge (GO) and random THE OTHER CLUSTERING ALGORITHMS.
initialization. Our method that reduces the randomness
for the genes assignment was compared to the random
method used by Fang. We also compared it with Tari me-
thod that used only one GO term category and one func-
tional annotation database. Both Fang and Tari methods
used the same benchmark dataset which enable us to
properly interpret the results. In Table 7, we presented
our CluFA performance in precision, recall, and F-
measure with SGD, MIPS, and Entrez functional annota- TABLE 8
tion databases for Eisen and Gasch datasets. We used the
optimal CS value from Table 3 and Table 4 and it pro- AVERAGE NUMBER OF ANNOTATIONS RANDOMLY SAMPLED IN 30
REPETITIONS FROM THE ORIGINAL 21,615 ANNOTATIONS FROM
duced better results for recall and F-measure compared to EISEN DATASET.
Fang and Tari. In comparing with Tari algorithm, we
used their optimal values in Eisen dataset (  = 0.3 and
r = 0.2) while for Fang algorithm, default threshold value
0.21 were used. The comparative results of precision,
recall, and F-measures obtained for different algorithms
proved the accuracy of our CluFA. This leads us to con-
clude that our cluster initialization using the functional
annotation databases obtained by fuzzy c-means cluster-
ing algorithm gives better biological significance as com-
pared to other algorithms. TABLE 9
3.4 Evaluation on the Impact of the GO annotation
AVERAGE NUMBER OF ANNOTATIONS RANDOMLY SAMPLED IN 30
REPETITIONS FROM THE ORIGINAL 21,615 ANNOTATIONS FROM
In this study our method used the incorporation of GO GASCH DATASET
annotations in SGD, Entrez, and MIPS functional annota-
tion databases in order to reduce random initialization
during the membership assignment thus producing better
results. Therefore, we tested different percentages of
sample annotations in order to assess the impact of the
gene assignment on the original sample annotations. This
is done by calculating the average number of genes as-
signed to the optimal clusters with the original annota-
tions as our reference. We used a random number type
double in the range of 0 to 1 as a probability value for
each gene in the expression list. If the random number for
a gene is higher than the sampling value, then we will
include the gene in the sampling list. For example, if we
choose 0.25 as a sampling value for 75% sample annota-
tion, any genes with random number higher than 0.25
will be included in the sampling list. As illustrated in Fig.
5 and Fig. 6, we studied the number of genes assigned to
the individual clusters when 25%, 50%, and 75% from the
original annotations were used. The number of annota-
tions for different percentage of annotations is shown in
Table 8 and Table 9. Sampling of annotations taken from
Eisen and Gasch datasets depicted a uniform pattern in
number of genes assigned and the percentage of GO an- Fig. 5. The comparison results of the different annotation percentage
notations used. The results also show that consistency in from the functional annotation databases on Eisen dataset.
assignment of cluster members to correct GO clusters ac-
cording to their original full annotation can be achieved whose annotations are less bountiful. Furthermore, this
despite the various degrees of annotations used. Al- experiment covers multiple functional annotation data-
though yeast has bountiful annotations compared to oth- bases with separate run for each database. This shows
er species, our CluFA can also be used for gene expres- that our method can support different types of functional
database formats. which we used 76 terms to form 76 clusters. The ran‐
dom initialization has been reduced with the incorpo‐
ration of functional annotation databases (SGD, Entrez,
and MIPS). The GO annotation evidence code taken
from functional annotation databases was used to cal‐
culate the initial membership thus producing consis‐
tent results. Findings from the results proved that with
the incorporation of the functional annotation databas‐
es, the clusters are highly compacted with furthest se‐
paration between clusters. The GO terms in the clus‐
ters are also more consistent and accurate. The usage
of all GO term categories has been utilized to produce
Fig. 6. The comparison results of the different annotation percentage
from the functional annotation databases on Gasch dataset. comprehensive results which we used combination of
biological process, molecular function, and cellular
3.5 The Unknown Gene Functions Prediction
component in our semi‐supervised clustering process.
For the SGD functional annotation database collected in
It is proved that by applying all the GO term catego‐
2005, 4,799 genes are marked as not annotated in both
Eisen and Gasch datasets. To validate the predictions, ries, results are better in terms of compactness and se‐
some of the predicted results from CluFA were compared paration. Furthermore, our newly proposed computa‐
against the SGD, Entrez, and MIPS functional annotations tional framework can predict the gene functions by the
for the unknown genes. They provide literature from ma-
utilization of the GO terms in order to group the un‐
nually curated, high-throughput, and computational an-
notation to support the evidence assigned to the genes. known genes. There are a number of other issues that
One of the findings from the results shows that biological we are aware of, but are beyond the scope of this in‐
process in GO term category for gene ‘YIL064W’ has not vestigation. For example, the genes overlap among
been annotated and is assigned as No Biological Data clusters, which may induce confusion regarding the
Available (ND). However, CluFA managed to assign this
gene to three categories for both Eisen and Gasch datasets overall cluster assignments. This could be improved by
which are GO:0016192 (biological process; vesicle-mediated finding the most dominant cluster for a particular gene
transport), GO:0005737 (cellular component; cytoplasm), so that a gene will appear in only one cluster. Thus,
and GO:0016740 (molecular function; transferase activity). suggest the gene’s most dominant function. In the fu‐
According to Martin-Granados et al. [36], ‘YIL064W’ was
ture, we will extract the overlapping gene and assign it
assigned to GO:0016192 (biological process; vesicle-
mediated transport), GO:0005737 (cellular component; cy- to the most dominant group. Furthermore, our current
toplasm), and GO:0008757 (molecular function; S- computational framework can also be extended to oth‐
adenosylmethionine-dependent methyltransferase activity). It er biological knowledge such as pathway and protein‐
has been proved that CluFA can predict the unknown
protein interaction.
biological process term for gene ‘YIL064W’, confirmed by
the current SGD functional annotation. Table 10 presents
some of the results of the predicted terms. ACKNOWLEDGMENT
We would like to thank Rathiah Hashim from the Univer-
siti Tun Hussein Onn Malaysia for proofreading this
4 CONCLUSION
journal. The authors would like to thank Universiti Tun
The aim of this paper is to give an overview of the Hussein Onn Malaysia for supporting this research. This
weaknesses of existing clustering algorithms for gene project is funded by the Malaysian Ministry of Science,
expression, and to introduce a new computational Technology, and Innovation (MOSTI) under ScienceFund
grant no. 02-01-06-SF0068
framework named CluFA that utilizes prior know‐
ledge from the GO to overcome those weaknesses. REFERENCES
However, in order to show the strength of CluFA, [1] F. Angiulli, E. Cesario, C. Pizzuti, Random walk biclustering
three main drawbacks in the literature are discussed. for microarray data, Information Sciences, 178 (6) (2008) 1479‐
They are the number of clusters, random initialization, 1497.
and comprehensive GO terms and their solutions have [2] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler,
J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig,
been presented. The determination in the number of
M.A. Harris, D.P. Hill, L. Issel‐Tarver, A. Kasarskis, S. Lewis,
clusters has been solved using GO slim dataset in J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G.
TABLE 10 [5] M. Bereta, T. Burczyński, Immune K‐means and negative
selection algorithms for data analysis, Information Sciences,
PREDICTED ANNOTATION BY CLUFA VERSUS THE CURRENT AN- 179 (10) (2009) 1407‐1425.
NOTATION FROM SGD, ENTREZ, AND MIPS. [6] J.C. Bezdek, Pattern recognition with fuzzy objective function
algorithms. Kluwer Academic Publishers: Norwell, MA; 1981.
[7] J. Botet, L. Mateos, J.L. Revuelta, M.A. Santos, A chemoge‐
nomic screening of sulfanilamide‐hypersensitive saccharo‐
myces cerevisiae mutants uncovers ABZ2, the gene encoding
a fungal aminodeoxychorismate lyase, Eukaryot Cell, 6 (11)
(2007) 2102‐2111.
[8] M. Brameier, C. Wiuf, Co‐clustering and visualization of gene
expression data and gene ontology terms for S. cerevisiae us‐
ing self‐organizing maps, Journal of Biomedical Informatics,
40 (2) (2007) 160‐173.
[9] Y. Cheng, R.M. Miura, B. Tian, Prediction of mRNA polyade‐
nylation sites by support vector machine, Bioinformatics, 22
(19) (2006) 2320‐2325.
[10] J.M. Cherry, C. Adler, C. Ball, S.A. Chervitz, S.S. Dwight, E.T.
Hester, Y. Jia, G. Juvik, T. Roe, M. Schroeder, S. Weng, D.
Botstein, SGD: saccharomyces genome database, Nucleic Ac‐
ids Research, 26 (1) (1998) 73‐80.
[11] E. Choi, J.M. Dial, D.E. Jeong, M.C. Hall, Unique D box and
KEN box sequences limit ubiquitination of Acm1 and pro‐
mote pseudosubstrate inhibition of the anaphase‐promoting
complex, The Journal of Biological Chemistry, 283 (35) (2008)
23701‐23710.
[12] A. Clare, R.D. King, Predicting gene function in Saccharo‐
myces cerevisiae. Bioinformatics, 19 (Suppl 2) (2003) 42‐49.
[13] I.G. Costa, R. Krause, L. Optiz, A. Schliep, Semi‐supervised
learning for the identification of syn‐expressed genes from
fused microarray and in situ image data, BMC Bioinformat‐
ics, 8 (2007) S3.
[14] G. Cui, X. Cao, Y. Wang, L. Cao, B. Huang, C. Yang, Wavelet
packet decomposition‐based fuzzy clustering algorithm for
gene expression data, in: Proceedings of the Asia Pacific Con‐
ference on Circuits and Systems, 2006, pp.1027‐1030.
[15] P.A. DiMaggio Jr, S.R. McAllister, C.A. Floudas, X.J. Feng,
J.D. Rabinowitz,, H.A. Rabitz, Biclustering via optimal re‐
ordering of data matrices in systems biology: rigorous me‐
thods and comparative studies, BMC Bioinformatics, 9 (2008)
458.
[16] M.T. Dittrich, G.W. Klau, A. Rosenwald, T. Dandekar, T.
Muller, Identifying functional modules in protein‐protein in‐
teraction networks: An integrated exact approach, Bioinfor‐
matics, 24 (13) (2008) i223‐i231.

The value in the bracket for GO terms category denotes as follows; [17] R.O. Duda, P.E. Hart, D.G. Stork, Pattern classification. 2nd
(P): biological process, (F): molecular function, and (C): cellular Edition. John‐Wiley & Son Inc.: New York; 2001.
component. [18] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster
analysis and display of genome‐wide expression patterns, in:
Sherlock, Gene ontology: tool for the unification of biology, Proceedings of the National Academy of Sciences of the Unit‐
Nature Genetics, 25 (1) (2000) 25‐29. ed States of America, 1998, pp.14863‐14868.
[3] R. Balasubramaniyan, E. Hüllermeier, N. Weskamp, J. [19] M. Enquist‐Newman, M. Sullivan, D.O. Morgan, Modulation
Kämper, Clustering of gene expression data using a local of the Mitotic Regulatory Network by APC‐Dependent De‐
shape‐based similarity measure, Bioinformatics, 21 (7) (2005) struction of the Cdh1 Inhibitor Acm1, Molecular Cell, 30 (4)
1069‐1077. (2008) 437‐446.
[4] S. Bandyopadhyay, A. Mukhopadhyay, U. Maulik, An im‐ [20] R.M. Ewing, P. Chu, F. Elisma, H. Li, P. Taylor, S. Climie, L.
proved algorithm for clustering gene expression data, Bioin‐ McBroom‐Cerajewski, M.D. Robinson, L. O’Connor, M. Li, R.
formatics, 23 (21) (2007) 2859‐2865. Taylor, M. Dharsee, Y. Ho, A. Heilbut, L. Moore, S. Zhang, O.
Ornatsky, Y.V. Bukhman, M. Ethier, Y. Sheng, J. Vasilescu, M.
Abu‐Farha, J.P. Lambert, H.S. Duewel, I.I. Stewart, B. Kuehl,
K. Hogue, K. Colwill, K. Gladwish, B. Muskat, R. Kinach, S.L. tein, confers both deficient heterologous protein production

Adams, M.F. Moran, G.B. Morin, T. Topaloglou, D. Figeys, and endocytosis, Yeast, 25 (12) (2008) 871‐877.
Large‐scale mapping of human protein‐protein interactions [37] K. McGarry, M. Sarfraz, J. MacIntyre, Integrating gene ex‐
by mass spectrometry, Molecular Systems Biology, 3 (2007) pression data from microarrays using the self‐organising map
89. and the gene ontology, in: Proceedings of the 4th IAPR Inter‐
[21] Z. Fang, Y. Li, Q. Luo, L. Liu, Knowledge guided analysis of national Conference on Pattern Recognition in Bioinformatics,
microarray data. Journal of Biomedical Informatics, 39 (4) 2007, pp. 206‐217.
(2006) 401‐411. [38] H.W. Mewes, K. Heumann, A. Kaps, K. Mayer, F. Pfeiffer, S.
[22] L. Fu, E. Medico, FLAME, a novel fuzzy clustering method Stocker, D. Frishman, MIPS: a database for genomes and pro‐
for the analysis of DNA microarray data, BMC Bioinformat‐ tein sequences. Nucleic Acids Research, 30 (1) (2002) 31‐34.
ics, 8 (2007) 3. [39] S. Mostafavi, D. Ray, D. Warde‐Farley, C. Grouios, Q. Morris,
[23] G. Gamberoni, S. Storari, S. Volinia, Finding biological GeneMANIA: a real‐time multiple association network inte‐
process modifications in cancer tissues by mining gene ex‐ gration algorithm for predicting gene function, Genome Biol‐
pression correlations, BMC Bioinformatics, 7 (2006) 6. ogy, 9 (2008) S4.
[24] A.P. Gasch, P.T. Spellman, C.M. Kao, O. Carmel‐Harel, M.B. [40] S. Nacu, R. Critchley‐Thorne, P. Lee, S. Holmes, Gene expres‐
Eisen, G. Storz, D. Botstein, P.O. Brown, Genomic expression sion network analysis and applications to immunology, Bio‐
programs in the response of yeast cells to environmental informatics, 23 (7) (2007) 850‐858.
changes, Molecular Biology of the Cell, 11 (12) (2000) 4241‐ [41] J. Natarajan, J. Ganapathy, Functional gene clustering via
4257. gene annotation sentences, MeSH and GO keywords from
[25] A. Ghouila, S.B. Yahia, D. Malouche, H. Jmel, D. Laouini, F.Z. biomedical literature, Bioinformation, 2 (5) (2007) 185‐193.
Guerfali, S. Abdelhak, Application of multi‐SOM clustering [42] K. Ovaska, M. Laakso, S. Hautaniemi, Fast gene ontology
approach to macrophage gene expression analysis, Infection, based clustering for microarray experiments, BioData Mining,
Genetics and Evolution, 9 (3) (2009) 328‐336. 1 (2008) 11.
[26] D. Hanisch, A. Zien, R. Zimmer, T. Lengauer, Co‐clustering [43] D. Ostapenko, J.L. Burton, R. Wang, M.J. Solomon, Pseudo‐
of biological networks and gene expression data, Bioinfor‐ substrate inhibition of the anaphase‐promoting complex by
matics, 18 (Suppl 1) (2002) 145‐154. Acm1: regulation by proteolysis and Cdc28 phosphorylation,
[27] T.J. Hestilow, Y. Huang, Clustering of gene expression data Molecular and Cellular Biology, 28 (15) (2008) 4653‐4664.
based on shape similarity, Journal on Bioinformatics and Sys‐ [44] H. Pang, H. Zhao, Building pathway clusters from random
tems Biology, 2009 (1) (2009) 195712‐195724. forests classification using class votes, BMC Bioinformatics, 9
[28] C. Hlynialuk, R. Schierholtz, A. Vernooy, G. van der Merwe, (2008) 87.
Nsf1/Ypl230w participates in transcriptional activation dur‐ [45] S.S. Ray, S. Bandyopadhyay, S.K. Pal, Gene ordering in parti‐
ing non‐fermentative growth and in response to salt stress in tive clustering using microarray expressions, Journal Bios‐
saccharomyces cerevisiae, Microbiology, 154 (8) (2008) 2482‐ cience, 32 (5) (2007) 1019‐1025.
2491. [46] S.S. Ray, S. Bandyopadhyay, S.K. Pal, Combining multi‐
[29] J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, N. source information through functional annotation based
Barkai, Revealing modular organization in the yeast tran‐ weighting: gene function prediction in yeast, IEEE Transanc‐
scriptional network, Nature Genetics, 31 (4) (2002) 370‐377. tions on Biomedical Engineering, 56 (2) (2009) 229‐236.
[30] D.W. Kim, B.Y. Kang, Iterative clustering analysis for group‐ [47] G.D. Schuler, J.A. Epstein, H. Ohkawa, J.A. Kans, Entrez:
ing missing data in gene expression profiles, in: Proceedings molecular biology database and retrieval system, Methods
of the 10th Pacific‐Asia Conference on Knowledge Discovery Enzymol, 266 (1996) 141‐162.
and Data Mining, 2006, pp.129‐138. [48] A. Sharma, R. Podolsky, J. Zhao, R.A. McIndoe, A modified
[31] S.A. Krause, H. Xu, J.V. Gray, The synthetic genetic network hyperplane clustering algorithm allows for efficient and accu‐
around PKC1 identifies novel modulators and components of rate clustering of extremely large datasets, Bioinformatics, 25
protein kinase C signaling in saccharomyces cerevisiae, Euka‐ (9) (2009) 1152‐1157.
ryot Cell, 7 (11) (2008) 1880‐1887. [49] S. Srivastava, L. Zhang, R. Jin, C. Chan, A novel method in‐
[32] X.L. Li, Y.C. Tan, S.K. Ng, Systematic gene function predic‐ corporating gene ontology information for unsupervised
tion from gene expression data by using a fuzzy nearest‐ clustering and feature selection, PLoS ONE, 3 (12) (2008)
cluster method, BMC Bioinformatics, 7 (Suppl 4) (2006) S23. e3860.
[33] A. Li, D. Tuck, An effective tri‐clustering algorithm combin‐ [50] C.P. Tanya, G.C. Steven, Computational methods to identify
ing expression data with gene regulation information. Gene novel methyltransferases, BMC Bioinformatics, 10 (Suppl 13)
Regulation and Systems Biology, 3 (2009) 49‐64. (2009) P7.
[34] G. Li, Z Wang, Incorporating protein‐protein interactions [51] L. Tari, C. Baral, S. Kim, Fuzzy c‐means clustering with prior
knowledge in clustering gene expression data, in: Proceed‐ biological knowledge, Journal of Biomedical Informatics, 42
ings of the International Conference on Bioinformatics and (1) (2009) 74‐81.
Biomedical Engineering, 2008, pp. 207‐210. [52] G.C. Tseng, Penalized and weighted k‐means for clustering
[35] Z. Lu, L. Hunter, GO molecular function terms are predictive with scattered objects and prior information in high‐
of subcellular localization, in: Proceedings of the Pacific Sym‐ throughput biological data, Bioinformatics, 23 (17) (2007)
posium, 2005, pp. 151‐161. 2247‐2255.
[36] C. Martin‐Granados, S.P. Riechers, U. Stahl, C. Lang, Absence
of see1p, a widely conserved saccharomyces cerevisiae pro‐
[53] K. Tu, H. Yu, X.Y. Li, Combining gene expression profiles respectively. Currently, he is Director of Laboratory of Computational
and protein‐protein interaction data to infer gene functions, Intelligence and Biotechnology at the Universiti Teknologi Malaysia.
Journal of Biotechnology, 124 (3) (2006) 475‐485. His research interests include bioinformatics, artificial intelligence,
[54] R.J.P. van Berlo, L.F.A. Wessels, S.D.C. Martes, M.J.T. software agent, parallel computing, and web semantics. In March
Reinders, Predicting gene function by combining expression 2005, he was awarded the Young Researcher award by the Malay-
and interaction data, in: Proceedings of the IEEE Computa‐ sian Association of Research Scientists (MARS). Two of his inven-
tional Systems Bioinformatics Conference, 2005, pp.166‐167. tions, software products named 2D Engineering Drawing Extractor
[55] W. Wang, Y. Zhang, On fuzzy cluster validity indices, Fuzzy and 2D Design Structure Recognizer, have won 5 awards at the 21st
Sets and Systems, 158 (19) (2007) 2095‐2117. Invention and New Product Exposition held in Pittsburgh, USA in-
[56] X.L. Xie, G. Beni, A validity measure for fuzzy clustering. cluding the Best Invention of the Pacific Rim, and a gold medal
IEEE Transactions on Pattern Analysis and Machine Intelli‐ award at the 34th International Exhibition of Inventions of New Tech-
gence, 13 (8) (1991) 841‐847. niques and Products held in Geneva, Switzerland.
[57] D. Xutao, G. Huimin, A.H. Hesham, A hidden Markov model
approach to predicting yeast gene function from sequential Rathiah Hashim received her B.Sc. degree in Computer Science
gene expression data, International Journal of Bioinformatics from Wichita State University, USA, and M.Sc. degree in Computer
Research and Applications, 4 (3) (2008) 263‐273. Science from Universiti Teknologi Malaysia. She obtained her Ph.D.
[58] Y. Yuan, C.T. Li, R. Wilson, Partial mixture model for tight degree in visualization and psychology at Swansea University, UK.
clustering of gene expression time‐course, BMC Bioinformat‐ She is currently a Senior Lecturer at Faculty of Science Computer
ics, 9 (287) (2008) 1471‐2105. and Information Technology, Universiti Tun Hussein Onn Malaysia
[59] M.L. Zhang, J.M. Peña, V. Robles, Feature selection for multi‐ (UTHM). Her research area includes Video Visualization, Image
label naive Bayes classification, Information Sciences, 179 (19) Processing, Psychology (Visual Perception), and HCI.
(2009) 3218‐3229.
[60] M. Zhang, T. Therneau, M.A. McKenzie, P. Li, P. Yang, A
fuzzy c‐means algorithm using a correlation metrics and gene
ontology, in: Proceedings of the International Conference on
Pattern Recognition, 2008, pp.1‐4.
[61] D. Zhu, Semi‐supervised gene shaving method for predicting
low variation biological pathways from genome‐wide data,
BMC Bioinformatics, 10 (Supppl 1) (2009) S54.
[62] A. Zien, R. Küffner, R. Zimmer, T. Lengauer, Analysis of gene
expression data with pathway scores, in: Proceedings of the
International Conference on Intelligence System Molecular
Biology, 2000, pp. 407‐417.
Shahreen Kasim is a doctoral candidate at the Faculty of Computer

Science and Information Systems, the Universiti Teknologi Malaysia.
She received BSc degrees in Computer Science in 2003, and MSc
degree in Information Technology in 2005 both from the Universiti
Teknologi Malaysia. She is also a reviewer at Journal of Bioinformat-
ics. Her research interests include bioinformatics, gene expression
analysis, and preprocessing methods such as dimension reduction
and missing value algorithms. Currently, she is a tutor at Faculty of
Science Computer and Information Technology, Universiti Tun Hus-
sein Onn Malaysia (UTHM).
Safaai Deris is a Professor of Artificial Intelligence and Software

Engineering at the Faculty of Computer Science and Information
System, Deputy Dean at the School of Graduate Studies, and Direc-
tor of Laboratory of Artificial Intelligence and Bioinformatics at the
Universiti Teknologi Malaysia. He received the MEng degree in In-
dustrial Engineering, and the DEng degree in Computer and System
Sciences, both from the Osaka Prefecture University, Japan, in 1989
and 1997 respectively. His recent academic interests include the
application and development of intelligent techniques in planning,
scheduling, and bioinformatics.
Razib M. Othman is a Senior Lecturer at the Faculty of Computer

Science and Information Systems, the Universiti Teknologi Malaysia.
He received the BSc, MSc, and PhD degrees in Computer Science
from the Universiti Teknologi Malaysia, in 1999, 2003, and 2008

Biological-Based Semi-Supervised Clustering Algorithm To Improve Gene Function Prediction

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Biological-Based Semi-Supervised Clustering Algorithm To Improve Gene Function Prediction

Încărcat de

Drepturi de autor:

Formate disponibile

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617

Biological-based Semi-supervised Clustering

G ene function prediction has become a popular and

based on Bezdek et al. [6]. Assuming

a vector set for k  number of iteration, j  1,..., cll , is

assign the unknown genes to t based on their expression  1  ( m 1)

= 0.3 and r = 0.2 using SGD functional annotation for TABLE 3

separation between the clusters when  = 0.3 and r =

RESULT OF EVALUATION ON THE CONSISTENCY OF THE CLUS-

3.3 Evaluation on the Biological Significance of the

K. Hogue, K. Colwill, K. Gladwish, B. Muskat, R. Kinach, S.L. tein, confers both deficient heterologous protein production

Shahreen Kasim is a doctoral candidate at the Faculty of Computer

Safaai Deris is a Professor of Artificial Intelligence and Software

Razib M. Othman is a Senior Lecturer at the Faculty of Computer

S-ar putea să vă placă și