Bioinformatics 2004 Huang 21 8

Vol. 20 no.
1 2004, pages 2128

DOI: 10.1093/bioinformatics/btg366
BIOINFORMATICS
Prediction of protein subcellular locations

using fuzzy k-NN method
Ying Huang and Yanda Li
State Key Laboratory of Intelligent Technology and Systems, Department
of Automation, Institute of Bioinformatics, Tsinghua University, Beijing 100084,
Peoples Republic of China
Received on December 10, 2002; revised on April 23, 2003; accepted on July 14, 2003
INTRODUCTION
With the progress in genome sequencing projects, an
enormous amount of raw sequence data accumulates
databanks. This raises the challenge of understanding the
functions of many genes from large-scale sequencing projects.
Protein localization data are a valuable information resource
helpful in elucidating protein functions (Chou and Elrod,
1999a,b; Chou, 2000b). Experimental determination of subcellular location is mainly accomplished by three approaches:
cell fractionation, electron microscopy and fluorescence
microscopy (Murphy et al., 2000). By immunolocalization
of epitope-tagged gene products, Kumar et al. (2002) have
determined the localization of 2744 yeast proteins. However,
currently it is still time-consuming and costly to acquire the
knowledge solely based on experimental measures. It is highly
To
whom correspondence should be addressed.
Bioinformatics 20(1) Oxford University Press 2004; all rights reserved.
desirable to predict a proteins subcellular locations automatically from its sequence. Since the pioneering efforts of Nakai
and Kanehisa (1991, 1992), there have been several attempts
in systematically predicting subcellular locations from protein
sequence.
Most of the existing prediction methods fall into two
categories: one is based on prediction of individual sorting signals; the other is based on amino acid composition
(Nakai, 2000). Nakai and Kanehisa (1991, 1992) were the
first who proposed to predict the subcellular location of
proteins based on their N-terminal sorting signals. This
approach was integrated eventually into PSORT prediction
system (Nakai and Horton, 1999). Von Heijne (1992) and
Nielsen et al. (1997, 1999) worked extensively on identifying individual sorting signals using neural networks. Then,
they combined these individual predictions into an integrated
systemTargetP (Emanuelsson et al., 2000) for subcellular
location prediction. A review of prediction of protein signal
sequences can be found in Chou (2002a). However, in systematic annotation of open reading frames found in a genome,
the assignments of 5 -regions are often unreliable. Therefore,
the prediction based on sorting signals is problematic when
signals are missing or only partially included (Reinhardt and
Hubbard, 1998).
Prediction based on amino acid composition was suggested
by Nakashima and Nishikawa (1994). They proposed an
algorithm to discriminate between intracellular and extracellular proteins by amino acid composition. Subsequently, there
are many ways to use amino acid composition for subcellular location. Cedano et al. (1997) proposed an algorithm
called ProtLock using the Mahalanobis distance (Chou, 1995).
Reinhardt and Hubbard (1998) used neural networks. Chou
and Elrod (1998, 1999a,b) proposed a covariant discrimination algorithm (Zhou and Assa-Munt, 2001). Zhou and Doctor
(2003) also used it for subcellular location prediction of apoptosis proteins. Other methods were based on Markov chain
models (Yuan, 1999) and support vector machine (SVM) (Hua
and Sun, 2001).
Predictions based only on amino acid composition may
lose some sequence-order information, but incorporating
21
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 6, 2016
ABSTRACT
Motivation: Protein localization data are a valuable
information resource helpful in elucidating protein functions.
It is highly desirable to predict a proteins subcellular locations
automatically from its sequence.
Results: In this paper, fuzzy k -nearest neighbors (k -NN)
algorithm has been introduced to predict proteins subcellular
locations from their dipeptide composition. The prediction is
performed with a new data set derived from version 41.0
SWISS-PROT databank, the overall predictive accuracy about
80% has been achieved in a jackknife test. The result demonstrates the applicability of this relative simple method and
possible improvement of prediction accuracy for the protein
subcellular locations. We also applied this method to annotate
six entirely sequenced proteomes, namely Saccharomyces
cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Oryza sativa, Arabidopsis thaliana and a subset of all
human proteins.
Availability: Supplementary information and subcellular location annotations for eukaryotes are available at
http://166.111.30.65/hying/fuzzy_loc.htm
Contact: hying99@mails.tsinghua.edu.cn
Y.Huang and Y.Li
MATERIALS AND METHODS

Sequence data
The data were selected from all eukaryotic proteins with
annotated subcellular location in SWISS-PROT release 41.0
(Boeckmann et al., 2003). All proteins with ambiguous words,
such as PROBABLE, POTENTIAL, POSSIBLE and BY
SIMILARITY, and proteins with multiple annotations of locations were excluded. The transmembrane proteins were also
excluded for they could be predicted quite reliably by some
known methods (Rost et al., 1996; Hirokawa et al., 1998;
Lio and Vannucci, 2000; Krogh et al., 2001). The remaining
12 865 proteins compose our raw data set (Data_SWISS). To
reduce bias and investigate the relation between prediction
accuracy and sequence identity in the data set, we also established two subsets of Data_SWISS on the basis of database
search technique BLAST (Altschul et al., 1990, 1997). They
are Data_80 and Data_50 with pairwise sequence identity (the
number of identical residues in an alignment of two proteins
divided by alignment length, which can be obtained directly
by using BLAST) less than 80 and 50%, respectively. The
numbers of proteins and their distributions in 11 categories are
listed in Table 1. All the SWISS-PROT codes in three data sets
are available at http://166.111.30.65/hying/fuzzy_loc.htm.
Here, we concentrated our attention on homology-reduced
Data_80, and presented result derived from Data_50 in
Supplementary material.
We also obtained all proteins belonging to six entirely
sequenced eukaryotic proteomes (Homo sapiens, Drosophila
melanogaster, Caenorhabditis elegans, Saccharomyces
22
Table 1. Eukaryotic sequences within each subcellular location group of the

data sets used in this study
Cellular location
Cytoplasm
Nuclear
Mitochondria
Extracellular
Golgi apparatus
Chloroplast
Endoplasmic reticulum
Cytoskeleton
Vacuole
Peroxisome
Lysosome
Total proteins
Data_SWISSa
Data_80b
Data_50c
2465
3419
1106
4228
34
1145
137
24
54
122
131
1251
2152
692
2135
31
645
82
10
41
81
83
622
1188
424
915
26
225
45
7
29
47
44
12 865
7203
3572
Number of proteins with known localization found in version 41.0 SWISS-PROT.

Derived from Data_SWISS with pairwise sequence identity <80%.
c
Derived from Data_SWISS with pairwise sequence identity <50%.
b
cerevisiae, Arabidopsis thaliana and Oryza sativa) from

SWISS-PROT+TREMBL databank for entire-proteome
predictions.
Algorithm
Instead of using amino acid composition, we use proteins
dipeptide composition (van Heel, 1991) to represent protein
sequences with fix-length feature vector. Dipeptide composition representation can be considered as a sort of n-gram
method, which was first proposed by Wu et al. (1992)
for sequence encoding. This method extracts and counts
the occurrences of n consecutive residues (n-gram) from a
sequence string in a sliding window fashion. So the count of
all 2-gram patterns is a 400 dimension vector, which can be
used to represent the protein sequence. Dipeptide composition (2-gram method) has been used to predict protein family
(Wang et al., 1998). Using dipeptide composition method for
sequence coding, we can incorporate some sequence-order
information, while the dimension of the feature vector is still
not very high.
The k-nearest neighbors (k-NN) algorithm is a simple
non-parametric classification algorithm (Duda et al., 2000).
Despite its simplicity, it can give competitive performance
compared to many other methods. It is widely used in machine
learning and has numerous variations. Given a test sample of
unknown label, it finds the k-NN in the training set and assigns
a label to the test sample according to the labels of those
neighbors. In biological and medical data classification problems, combining fuzzy set theory with k-NN algorithm can
often improve classification performance (Keller et al., 1985;
Bezdek et al., 1993; Leszczynski et al., 1999). Zhang et al.
(1995) has also used fuzzy clustering to predict protein structural class. Therefore, we used the fuzzy k-NN algorithm to
this information may improve prediction performance. Chou

(2000a) was the first who proposed an augmented covariant
discrimination algorithm to incorporate quasi-sequence-order
effect, and a remarkable improvement in prediction quality was achieved. Subsequently, Chou (2001) has further
introduced a novel concept, the so-called pseudo-amino acid
composition to reflect the protein sequence-order effect in
term of a set of discrete numbers. Recently, Cai et al. (2002)
used SVM incorporating quasi-sequence-order effect. A novel
concept, the so-called functional domain composition was
also introduced by Chou and Cai (2002) for representation
of protein sequence.
We introduced fuzzy k-NN method in this paper to predict
proteins subcellular locations based on dipeptide composition. Dipeptide composition can be considered as another
representative form of proteins incorporating neighborhood
information. High prediction accuracy has been obtained in
a jackknife test. The current approach cannot only play an
important complementary role to previous powerful methods
(Chou, 2000a, 2001; Cai et al. 2002), but also be helpful
for this new branch of proteomics (Chou, 2002b). Finally,
we applied our method to annotate six entirely sequenced
eukaryotes proteomes.
Measurement accuracy
We use jackknife test for cross-validation. In comparison with
subsampling test or independent data set test, the jackknife
test is thought to be more rigorous and reliable (Mardia et al.,
1979). Chou and Zhang (1995) also provided a comprehensive
discussion about this problem. During the process of jackknife
test, each protein is singled out in turn as a test sample, the
remaining proteins are used as training set to calculate test
samples membership and predict the class. The prediction
quality was evaluated by the overall prediction accuracy and
prediction accuracy for each location.
k
p(s)
overall accuracy = s=1
(2)
N
p(s)
accuracy(s) =
(3)
obs(s)
where N is the total number of sequences, k is the class number, obs(s) is the number of sequences observed in location
s and p(s) is the number of correctly predicted sequences in
location s.
The other measure of prediction accuracy is Matthews correlation coefficients (MCC) (Matthews, 1975) between the
observed and predicted locations over a data set, as given by:
MCC(s)
=
p(s)n(s) u(s)o(s)
.
(p(s) + u(s))(p(s) + o(s))(n(s) + u(s))(n(s) + o(s))
(4)
Here, p(s) is the number of properly predicted proteins in

location s, n(s) is the number of correctly predicted proteins
not in location s, u(s) is the number of under-predicted and
o(s) is the number of over-predicted sequences.
RESULTS AND DISCUSSION

Prediction accuracy of fuzzy k-NN Method
Tests have been done with various values of the fuzzy strength
parameter m and the number of nearest neighbor k. Using
leave-one-out cross-validation, we found the best result was
achieved when m = 1.05. We then calculated the overall prediction accuracy with fuzzy strength parameter m = 1.05. For
Data_80, the dependence of the total prediction accuracy on
the number of nearest neighbors, k, is shown in Figure 1. It
can be seen that the prediction accuracy does not change significantly when k is greater than or equal to 15 while m = 1.05.
Therefore, we selected k = 15 and m = 1.05 for the subsequent analysis. The similar result was obtained on Data_50,
which can be found in Supplementary Figure 4.
Performance related to thresholds of similarity

Our method relies on sequence information, so predictive
accuracy is closely related to pairwise sequence identity in
the data set. In order to investigate the influence of pairwise sequence identity on the prediction performance, we
performed our method to two different sequence identity data
sets, Data_80 and Data_50.
The jackknife testing resulting for Data_80 and Data_50 are
listed in Table 2. The prediction applied to different data sets
resulted in different overall predictive accuracy. For Data_80,
our method achieved overall accuracy 80.1% . There are
3572 sequences in Data_50, which is about 50% of Data_80.
For this data set, the predictive accuracy is 58.1%. A drop in
the accuracy of cytoplasmic proteins (from 70.2% to 35.4%) is
a main reason for the decrease. Mitochondrial and chloroplast
proteins also have low predictive accuracy. However, the predictive accuracy of extracellular proteins changed from 93.7 to
81.6% . It is not very bad. The prediction accuracy of nuclear
proteins also remains 71.5% in Data_50. Therefore, the influence of pairwise similarity on predictive accuracy varies with
different compartments. This may indicate that sequence conservations are different in these groups. Such result is worthy
of a deeper investigation.
Confusion matrix analysis

To evaluate our approach in detail, a confusion matrix is constructed according to the result of jackknife test and shown
in Table 3. (Confusion matrix on Data_50 can be found in
Supplementary Table 6.) We can see from Tables 2 and 3
that predictive accuracy varies substantially with subcellular
locations. Nuclear and extracellular proteins can be inferred
more reliably than other classes. On the other hand, performance for cytoplasmic and mitochondrial proteins is not very
23
predict subcellular locations. This method assigns fuzzy memberships of samples to different categories rather than a particular class as in k-NN. Here class memberships are assigned
to the test sample, according to the following relationship:
k
(j )
(j ) 2/(m1) )
j =1 ui (x )(x x
ui (x) =
i = 1, . . . , c
k
(j ) 2/(m1) )
j =1 (x x
(1)
where m is a fuzzy strength parameter, which determines how
heavily the distance is weighted when calculating each neighbors contribution to the membership value. The variable k
is the number of nearest neighbors, ui (x) is the membership
of the test sample x, to class i. x x (j ) is the distance
between the test sample x and its nearest training samples x (j ) .
Various distance measures can be used, such as Euclidean,
absolute and Mahalanobis distance measures. In the present
study, we used the Euclidean distance measure. ui (x (j ) ) is
the membership value of the j -th neighbor to the i-th class,
it can be assigned in several way. The crispest way is to
assign 1 if x (j ) belongs to i-th class otherwise assign 0. A
more fuzzy alternative is to assign the training samples
memberships based on the k-NN rule. In our analysis, we
define the membership via crispest way. After calculating
the memberships for the test sample, it is assigned to the class
with highest membership value.
Y.Huang and Y.Li
Table 2. The predictive accuracy for subcellular locations of different data

sets (corresponding to different thresholds of pairwise sequence identity)
Table 3. Confusion matrix for prediction results of Data_80
Cellular location
Actual Predicted group

group Cytop Nuc Mit Ext
Data_80
Accuracy (%)
MCC
Data_50
Accuracy (%)
MCC
Cytoplasm
Nuclear
Mitochondria
Extracellular
Golgi apparatus
Chloroplast
Endoplasmic reticulum
Cytoskeleton
Vacuole
Peroxisome
Lysosome
70.2
81.9
59.0
93.7
16.1
84.7
57.3
40.0
34.1
56.8
67.5
0.67
0.78
0.62
0.79
0.32
0.80
0.71
0.57
0.55
0.68
0.74
35.4
71.5
36.6
81.6
15.4
32.4
11.1
28.6
6.9
14.9
20.5
0.31
0.58
0.30
0.54
0.27
0.36
0.22
0.44
0.16
0.27
0.31
Overall accuracy
80.1
58.1
good. It can be seen from Table 3 that cytoplasmic proteins

are often confused with nuclear and extracellular proteins, and
proteins from mitochondria are most often assigned incorrectly to extracellular space. Accuracy for the minor classes
that contained too few proteins (Golgi, endoplasmic reticulum, cytoskeleton, vacuole, peroxisome and lysosome) are
not very good as well. About 30% of vacuole proteins are
classified as extracellular; perhaps they are involved in the
secretory pathway.
24
Cytop
Nuc
Mit
Ext
Gol
Chl
Endo
Cytos
Vac
Pero
Lyso
Sum
878 136
107 1762
71
37
43
53
5
7
26
11
19
6
1
4
3
6
11
4
5
5
SUM
Gol Chl Endo Cytos Vac Pero Lyso
51
146 1
44
195 2
408 119 0
19 2001 0
3
9 5
30
30 0
2
5 0
0
1 0
2
12 0
6
8 0
2
14 0
1169 2031 567
2540 8
29 2
40 1
54 0
11 0
0 2
546 0
2 47
0 0
1 1
6 0
1 0
0
1
0
0
0
0
0
4
0
0
0
0 5
0 0
0 3
2 0
0 0
0 2
0 0
0 0
14 0
0 46
0 0
3
0
0
6
0
0
1
0
2
0
56
1251
2152
692
2135
31
645
82
10
41
81
83
690 53
16 56
68
7203
Matrix delineates distribution of actual compared with predicted class membership.

Abbreviations for localizations: Cytop: Cytoplasm; Nuc: Nuclear; Mit: Mitochondria;
Ext: Extracellular; Gol: Golgi apparatus; Chl: Chloroplast; Endo: Endoplasmic reticulum; Cytos: Cytoskeleton; Vac: Vacuole; Pero: Peroxisome and Lyso: Lysosome.
Correct classifications are in bold letters.
Reliability index calculation

When neural network is used for subcellular location prediction, the difference between the highest and the next highest
network output scores is used as a reliability index (RI) for a
prediction (Reinhardt and Hubbard, 1998; Emanuelsson et al.,
Fig. 1. The dependence of the overall prediction accuracy on the number of nearest neighbors, k, used in the fuzzy k-NN classification (fuzzy
strength parameter m = 1.05). These results were obtained on Data_80 using the Euclidean distance measure.
2000). As fuzzy k-NN method assigns class memberships to

an input pattern x rather than a particular class, the membership values of an input pattern would provide a level of
confidence to the resultant classification. We can define a RI
in the same way. The assignment of RI is based on the difference between the highest and the next highest membership
value for a prediction. RI is defined as

INTEGER(diff) 10 + 1 0 diff < 0.9
RI =
(5)
10
diff 0.9
The RI assignment can give some information about the
certainty of the classification decision. Figure 2 shows the
expected prediction accuracy and the fractions of sequences
with given RI value (similar figure for Data_50 can be found in
Figure 5 of Supplementary material). We can find about 60%
of all sequences has a RI index 10 with expected prediction
accuracy >95%. Average prediction accuracy was also calculated with RI above a given threshold, as shown in Figure 3
(similar figure for Data_50 can be found in Figure 6 of Supplementary material). For example, about 80% of sequences
have RI 5, and of these sequences about 90% were correctly
predicted by fuzzy k-NN method.
Entire proteome annotation

Using our method and sequences from SWISS-PROT+
TREMBL databank, we obtained subcellular location
annotations for six proteomes. Because we excluded membrane proteins in our prediction, so we first discriminated
sequences without annotated subcellular location using

TMHMM (Krogh et al., 2001, http://www.cbs.dtu.dk/services/
TMHMM). Sequences with TMHMM prediction result
PredHel=0 were considered to be soluble proteins and
predicted with our fuzzy k-NN classifier. Predicted distributions for six major subcellular locations are listed in
Table 4, and the annotation for individual protein can be
found at http://166.111.30.65/hying/fuzzy_loc.htm. Because
we included chloroplast sequences in prediction, some proteins of YEAST (HOMO, CAEEL and DROME) were also
predicted as chloroplast proteins. This minor mistake could
be revised by excluding plant proteins for prediction in further
research.
This prediction result could give us a rough estimate of
protein distribution in cell. It can be found that fractions of
cytoplasmic and mitochondrial proteins in total proteomes do
not have significant change over organisms. However, fraction of nuclear proteins in YEAST and DROME proteomes
are larger than other proteomes. Is such phenomenon just
prediction bias, or does it reflect some difference in these
proteomes organization? In a recent study of YEAST proteomes (Kumar et al., 2002), 2452 soluble cytoplasmic proteins
have been estimated. The result is different from their previous study (Drawid and Gerstein, 2000), and also different
from our prediction. The difference may be caused by using
different training set and protein features. It also indicates
that such genome-wide analysis would be more reliable by
integrating different experimental and prediction methods.
25
Fig. 2. Average predictive accuracy related to RI. We also give fractions of sequences with various RI values. For example, about 5% of all
sequences have RI = 9, and of these sequences about 80% are correctly classified. The figure is based on Data_80.
Y.Huang and Y.Li
Table 4. Distribution of predicted subcellular localization for six proteomesa
Organism
Totalb
MEMBRANE
NUCLEAR
CYTOPLA
MITOCH
EXTRACELL
CHLOROPLAST
ORYSA
ARATH
YEAST
CAEEL
DROME
HOMO
9420
36 528
6905
20 887
19 978
44 402
1662 (17.6%c )
8282 (22.7%)
1489 (21.6%)
6572 (31.5%)
4294 (21.5%)
9146 (20.6%)
2088 (22.2%)
7431 (20.3%)
1997 (28.9%)
4316 (20.7%)
5914 (29.6%)
10 081 (22.7%)
1044 (11.1%)
5122 (14.0%)
876 (12.7%)
2405 (11.5%)
2558 (12.8%)
4639 (10.5%)
513 (5.5%)
2086 (5.7%)
470 (6.8%)
898 (4.3%)
1018 (5.1%)
1561 (3.5%)
3410 (36.2%)
10781 (29.5%)
1795 (26.0%)
5833 (27.9%)
5380 (26.9%)
17 978 (40.5%)
622 (6.6%)
2499 (6.8%)
221 (3.2%)
744 (3.6%)
587 (2.9%)
615 (1.4%)
Abbreviations for organism: ORYSA: Oryza sativa; ARATH: Arabidopsis thaliana; YEAST: Saccharomyces cerevisiae; CAEEL: Caenorhabditis elegans; DROME: Drosophila
melanogaster; HOMO: Homo sapiens.
a
Only annotation of six major subcellular location are listed.
b
Number of protein sequence in SWISS-PROT + TREMBL databank.
c
Fraction in total proteins.
Comparison with other methods

We also applied our method to the data set used by other
groups (Reinhardt and Hubbard, 1998; Yuan, 1999; Hua and
Sun, 2001), so that we can make direct comparison with
other methods. There are 2427 eukaryotic proteins in their
data set, 684 cytoplasmic, 325 extracellular, 1097 nuclear and
321 mitochondrial proteins. Reinhardt and Hubbard (1998)
first used neural network approach to achieve 66% accuracy
for this data set. Yuan (1999) used Markov chain models to
achieve 73% accuracy, while Hua and Sun (2001) used SVM
approach to achieve 79.4% accuracy. We achieved 85.2%
accuracy in a jackknife test. The details of the comparison
can be found in Table 5. ReinhardtHubbard data set may be
26
old and include only four subcellular locations. However, the

result can demonstrate the applicability of this relative simple
method and possible improvement of prediction accuracy for
the protein subcellular locations.
CONCLUSION
In this paper, fuzzy k-NN method based on proteins dipeptide
composition was proposed for prediction of subcellular locations. An advantage of the new method is its incorporating
sequence-order effects into prediction. This method was
performed to a new data set derived from version 41.0
SWISS-PROT databank, and high predictive accuracy has
been achieved in a jackknife test. This indicates that extracting
Fig. 3. Average prediction accuracy was also calculated cumulatively with RI above a given value. For example, about 75% of all sequences
have RI 6, and of these sequences about 92% are correctly predicted. The result is based on Data_80.
Table 5. Performance comparison with other methods by a jackknife test
Markov model
Accuracy MCC
(%)
SVM
Accuracy
(%)
MCC
Fuzzy k-NNa
Accuracy MCC
(%)
Cytoplasmic
Extracellular
Mitochondrial
Nuclear
78.1
62.2
69.2
74.1
0.60
0.63
0.53
0.68
76.9
80.0
56.7
87.4
0.64
0.78
0.58
0.75
86.7
83.7
60.4
92.0
0.76
0.87
0.63
0.83
Total accuracy
73.0
79.4
85.2
Location
In this test, fuzzy strength number m = 1.05, number of nearest neighbors k = 20.
ACKNOWLEDGEMENTS
The authors would like to thank Dr A. Reinhardt for providing his data set. Thanks to Jun Cai for helpful discussions and
Dr Liang Ji for valuable comments on the manuscript. We also
thank the anonymous reviewers for their helpful comments.
This work was funded by the National Natural Science Grant
in China (Nos 60171038 and 60234020) and the National
Basic Research Priorities Program of the Ministry of Science
and Technology (No. 2001CCA0). Y.H. also thanks Tsinghua
University Ph.D. Grant for the support.
REFERENCES
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J.
(1990) Basic local alignment search tool. J. Mol. Biol., 215,
403410.
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res., 25, 33893402.
Bezdek,J.C., Hall,L.O. and Clarke,L.P. (1993) Review of MR image
segmentation techniques using pattern recognition. Med. Phys.,
20, 10331048.
Boeckmann,B.,
Bairoch,A.,
Apweiler,R.,
Blatter,M.C.,
Estreicher,A.,
Gasteiger,E.,
Martin,M.J.,
Michoud,K.,
ODonovan,C., Phan,I., Pilbout,S. and Schneider,M. (2003)
The SWISS-PROT protein knowledgebase and its supplement
TrEMBL in 2003. Nucleic Acids Res., 31, 365370.
Cedano,J., Aloy,P., Perez-Pons,J.A. and Querol,E. (1997) Relation
between amino acid composition and cellular location of proteins.
J. Mol. Biol., 266, 594600.
27
more useful information within the primary sequences can be

helpful in subcellular location prediction. This method just
needs raw sequence data, so we can apply it to infer subcellular locations of protein that has only sequence information.
As a demonstration, we have used it to annotate six eukaryotic proteomes. Integrating with other powerful algorithms
(Chou, 2000a, 2001; Cai et al., 2002) this method is anticipated to contribute to systematic analysis of great amounts of
genome data.
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002) Support vector machines for prediction of protein subcellular location by
incorporating quasi-sequence-order effect. J. Cell. Biochem., 84,
343348.
Chou,K.C. (1995) A novel approach to predicting protein structural
classes in a (20-1)-D amino acid composition space. Proteins
Struct. Funct. Genet., 21, 319344.
Chou,K.C. (2000a) Prediction of protein subcellular locations by
incorporating quasi-sequence-order effect. Biochem. Biophys.
Res. Commun., 278, 477483.
Chou,K.C. (2000b) Review: prediction of protein structural classes
and subcellular locations. Curr. Protein Peptide Sci., 1, 171208.
Chou,K.C. (2001) Prediction of protein cellular attributes using
pseudo-amino acid composition. Proteins Struct. Funct. Genet.,
43, 246255.
Chou,K.C. (2002a) Prediction of protein signal sequences. Curr.
Protein Peptide Sci., 3, 615622.
Chou,K.C. (2002b) A new branch of proteomics: prediction of protein cellular attributes. In Weinrer,P.W. and Lu,Q. (eds), Gene
Cloning & Expression Technologies, Chapter 4. Eaton Publishing,
Westborough, MA, pp. 5770.
Chou,K.C. and Cai,Y.D. (2002) Using functional domain composition and support vector machines for prediction of protein
subcellular location. J. Biol. Chem., 277, 4576545769.
Chou,K.C. and Elrod,D.W. (1998) Using discriminant function
for prediction of subcellular location of prokaryotic proteins.
Biochem. Biophys. Res. Commun., 252, 6368.
Chou,K.C. and Elrod,D.W. (1999a) Protein subcellular location
prediction. Protein Eng., 12, 107118.
Chou,K.C. and Elrod,D.W. (1999b) Prediction of membrane protein
types and subcellular locations. Proteins Struct. Funct. Genet., 34,
137153.
Chou,K.C. and Zhang,C.T. (1995) Review: prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol., 30,
275349.
Drawid,A. and Gerstein, M. (2000) A Bayesian system integrating
expression data with sequence patterns for localizing proteins:
comprehensive application to the yeast genome. J. Mol. Biol.,
301, 10591075.
Duda,R.O., Hart,P.E. and Stork,D.G. (2000) Pattern Classification,
2nd edn. Wiley, New York.
Emanuelsson,O., Nielsen,H., Brunak,S. and von Heijne,G. (2000)
Predicting subcellular localization of proteins based on their
N-terminal amino acid sequence. J. Mol. Biol., 300, 10051016.
van Heel,M. (1991) A new family of powerful multivariate
statistical sequence analysis techniques. J. Mol. Biol., 220,
877887.
von Heijne,G. (1992) Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J. Mol. Biol., 225,
487494.
Hirokawa,T., Boon-Chieng,S. and Shigeki,M. (1998) SOSUI: classification and secondary structure prediction system for membrane
proteins. Bioinformatics, 14, 378379.
Hua,S. and Sun,Z. (2001) Support vector machine approach for
protein subcellular localization prediction. Bioinformatics, 17,
721728.
Keller,J.M., Gray,M.R. and Givens,J.A. (1985) A fuzzy k-nearest
neighbour algorithm. IEEE Trans. Syst. Man Cybern., 15,
580585.
Y.Huang and Y.Li
28
Nakashima,H. and Nishikawa,K. (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol., 238,
5461.
Nielsen,H., Engelbrecht,J., Brunak,S. and von Heijne,G. (1997)
Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng.,
10, 16.
Nielsen,H., Brunak,S. and von Heijne,G. (1999) Machine learning
approaches for the prediction of signal peptides and other protein
sorting signals. Protein Eng., 12, 39.
Reinhardt,A. and Hubbard,T. (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res.,
26, 22302236.
Rost,B., Fariselli,P. and Casadio,R. (1996) Topology prediction for
helical transmembrane proteins at 86% accuracy. Protein Sci., 5,
17041718.
Yuan,Z. (1999) Prediction of protein subcellular locations using Markov chain models. FEBS Lett., 451,
2326.
Wang,H.C., Dopazo,J., de la Fraga,L.G., Zhu,Y.P. and Carazo,J.M.
(1998) Self-organizing tree-growing network for the classification
of protein sequences. Protein Sci., 7, 26132622.
Wu,C., Whitson,G., McLarty,J., Ermongkonchai,A. and Chang,T.C.
(1992) Protein classification artificial neural system. Protein Sci.,
1, 667677.
Zhang,C.T., Chou,K.C. and Maggiora,G.M. (1995) Predicting protein structural classes from amino acid composition: application
of fuzzy clustering. Protein Eng., 8, 425435.
Zhou,G.P. and Assa-Munt,N. (2001) Some insights into protein
structural class prediction.Proteins Struct. Funct. Genet., 44,
5759.
Zhou,G.P. and Doctor,K. (2003) Subcellular location prediction
of apoptosis proteins. Proteins Struct. Funct. Genet., 50,
4448.
Krogh,A., Larsson,B., von Heijne,G. and Sonnhammer,E.L. (2001)

Predicting transmembrane protein topology with a hidden Markov
model: application to complete genomes. J. Mol. Biol., 305,
567580.
Kumar,A., Agarwal,S., Heyman,J.A., Matson,S., Heidtman,M.,
Piccirillo,S., Umansky,L., Drawid,A., Jansen,R., Liu,Y. et al.
(2002) Subcellular localization of the yeast proteome. Genes Dev.,
16, 707719.
Lio,P. and Vannucci,M. (2000) Wavelet change-point prediction of
transmembrane proteins. Bioinformatics, 16, 376382.
Leszczynski,K., Cosby,S., Bissett,R., Provost,D., Boyko,S.,
Loose,S. and Mvilongo, E. (1999) Application of a fuzzy pattern
classifier to decision making in portal verification of radiotherapy.
Phys. Med. Biol., 44, 253269.
Mardia,K.V., Kent,J.T. and Bibby,J.M. (1979) Multivariate Analysis.
Academic Press, London, pp. 322 and 381.
Matthews,B.W. (1975) Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta,
405, 442451.
Murphy,R.F., Boland,M.V. and Velliste,M. (2000) Towards a systematics for protein subcelluar location: quantitative description
of protein localization patterns and automated analysis of fluorescence microscope images. Proc. Int. Conf. Intell. Syst. Mol. Biol.,
8, 251259.
Nakai,K. (2000) Protein sorting signals and prediction of subcellular
localization. Adv. Protein Chem., 54, 277344.
Nakai,K. and Horton,P. (1999) PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization.
Trends Biochem. Sci., 24, 3436.
Nakai,K. and Kanehisa,M. (1991) Expert system for predicting protein localization sites in Gram-negative bacteria. Proteins Struct.
Funct. Genet., 11, 95110.
Nakai,K. and Kanehisa,M. (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics, 14,
897911.

Bioinformatics 2004 Huang 21 8

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Bioinformatics 2004 Huang 21 8

Încărcat de

Drepturi de autor:

Formate disponibile

Vol. 20 no.

1 2004, pages 2128

Prediction of protein subcellular locations

whom correspondence should be addressed.

Bioinformatics 20(1) Oxford University Press 2004; all rights reserved.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 6, 2016

Y.Huang and Y.Li

MATERIALS AND METHODS

Table 1. Eukaryotic sequences within each subcellular location group of the

Number of proteins with known localization found in version 41.0 SWISS-PROT.

cerevisiae, Arabidopsis thaliana and Oryza sativa) from

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 6, 2016

this information may improve prediction performance. Chou

Prediction of protein subcellular locations

(p(s) + u(s))(p(s) + o(s))(n(s) + u(s))(n(s) + o(s))

Here, p(s) is the number of properly predicted proteins in

RESULTS AND DISCUSSION

Performance related to thresholds of similarity

Confusion matrix analysis

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 6, 2016

Y.Huang and Y.Li

Table 2. The predictive accuracy for subcellular locations of different data

Table 3. Confusion matrix for prediction results of Data_80

Actual Predicted group

good. It can be seen from Table 3 that cytoplasmic proteins

1169 2031 567

Matrix delineates distribution of actual compared with predicted class membership.

Reliability index calculation

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 6, 2016

Prediction of protein subcellular locations

2000). As fuzzy k-NN method assigns class memberships to

Entire proteome annotation

sequences without annotated subcellular location using

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 6, 2016

Y.Huang and Y.Li

Table 4. Distribution of predicted subcellular localization for six proteomesa

Comparison with other methods

old and include only four subcellular locations. However, the

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 6, 2016

Prediction of protein subcellular locations

Table 5. Performance comparison with other methods by a jackknife test

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 6, 2016

more useful information within the primary sequences can be

Y.Huang and Y.Li

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 6, 2016

Krogh,A., Larsson,B., von Heijne,G. and Sonnhammer,E.L. (2001)

S-ar putea să vă placă și