Documente Academic
Documente Profesional
Documente Cultură
BIOINFORMATICS
INTRODUCTION
With the progress in genome sequencing projects, an
enormous amount of raw sequence data accumulates
databanks. This raises the challenge of understanding the
functions of many genes from large-scale sequencing projects.
Protein localization data are a valuable information resource
helpful in elucidating protein functions (Chou and Elrod,
1999a,b; Chou, 2000b). Experimental determination of subcellular location is mainly accomplished by three approaches:
cell fractionation, electron microscopy and fluorescence
microscopy (Murphy et al., 2000). By immunolocalization
of epitope-tagged gene products, Kumar et al. (2002) have
determined the localization of 2744 yeast proteins. However,
currently it is still time-consuming and costly to acquire the
knowledge solely based on experimental measures. It is highly
To
desirable to predict a proteins subcellular locations automatically from its sequence. Since the pioneering efforts of Nakai
and Kanehisa (1991, 1992), there have been several attempts
in systematically predicting subcellular locations from protein
sequence.
Most of the existing prediction methods fall into two
categories: one is based on prediction of individual sorting signals; the other is based on amino acid composition
(Nakai, 2000). Nakai and Kanehisa (1991, 1992) were the
first who proposed to predict the subcellular location of
proteins based on their N-terminal sorting signals. This
approach was integrated eventually into PSORT prediction
system (Nakai and Horton, 1999). Von Heijne (1992) and
Nielsen et al. (1997, 1999) worked extensively on identifying individual sorting signals using neural networks. Then,
they combined these individual predictions into an integrated
systemTargetP (Emanuelsson et al., 2000) for subcellular
location prediction. A review of prediction of protein signal
sequences can be found in Chou (2002a). However, in systematic annotation of open reading frames found in a genome,
the assignments of 5 -regions are often unreliable. Therefore,
the prediction based on sorting signals is problematic when
signals are missing or only partially included (Reinhardt and
Hubbard, 1998).
Prediction based on amino acid composition was suggested
by Nakashima and Nishikawa (1994). They proposed an
algorithm to discriminate between intracellular and extracellular proteins by amino acid composition. Subsequently, there
are many ways to use amino acid composition for subcellular location. Cedano et al. (1997) proposed an algorithm
called ProtLock using the Mahalanobis distance (Chou, 1995).
Reinhardt and Hubbard (1998) used neural networks. Chou
and Elrod (1998, 1999a,b) proposed a covariant discrimination algorithm (Zhou and Assa-Munt, 2001). Zhou and Doctor
(2003) also used it for subcellular location prediction of apoptosis proteins. Other methods were based on Markov chain
models (Yuan, 1999) and support vector machine (SVM) (Hua
and Sun, 2001).
Predictions based only on amino acid composition may
lose some sequence-order information, but incorporating
21
ABSTRACT
Motivation: Protein localization data are a valuable
information resource helpful in elucidating protein functions.
It is highly desirable to predict a proteins subcellular locations
automatically from its sequence.
Results: In this paper, fuzzy k -nearest neighbors (k -NN)
algorithm has been introduced to predict proteins subcellular
locations from their dipeptide composition. The prediction is
performed with a new data set derived from version 41.0
SWISS-PROT databank, the overall predictive accuracy about
80% has been achieved in a jackknife test. The result demonstrates the applicability of this relative simple method and
possible improvement of prediction accuracy for the protein
subcellular locations. We also applied this method to annotate
six entirely sequenced proteomes, namely Saccharomyces
cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Oryza sativa, Arabidopsis thaliana and a subset of all
human proteins.
Availability: Supplementary information and subcellular location annotations for eukaryotes are available at
http://166.111.30.65/hying/fuzzy_loc.htm
Contact: hying99@mails.tsinghua.edu.cn
22
Cellular location
Cytoplasm
Nuclear
Mitochondria
Extracellular
Golgi apparatus
Chloroplast
Endoplasmic reticulum
Cytoskeleton
Vacuole
Peroxisome
Lysosome
Total proteins
Data_SWISSa
Data_80b
Data_50c
2465
3419
1106
4228
34
1145
137
24
54
122
131
1251
2152
692
2135
31
645
82
10
41
81
83
622
1188
424
915
26
225
45
7
29
47
44
12 865
7203
3572
Algorithm
Instead of using amino acid composition, we use proteins
dipeptide composition (van Heel, 1991) to represent protein
sequences with fix-length feature vector. Dipeptide composition representation can be considered as a sort of n-gram
method, which was first proposed by Wu et al. (1992)
for sequence encoding. This method extracts and counts
the occurrences of n consecutive residues (n-gram) from a
sequence string in a sliding window fashion. So the count of
all 2-gram patterns is a 400 dimension vector, which can be
used to represent the protein sequence. Dipeptide composition (2-gram method) has been used to predict protein family
(Wang et al., 1998). Using dipeptide composition method for
sequence coding, we can incorporate some sequence-order
information, while the dimension of the feature vector is still
not very high.
The k-nearest neighbors (k-NN) algorithm is a simple
non-parametric classification algorithm (Duda et al., 2000).
Despite its simplicity, it can give competitive performance
compared to many other methods. It is widely used in machine
learning and has numerous variations. Given a test sample of
unknown label, it finds the k-NN in the training set and assigns
a label to the test sample according to the labels of those
neighbors. In biological and medical data classification problems, combining fuzzy set theory with k-NN algorithm can
often improve classification performance (Keller et al., 1985;
Bezdek et al., 1993; Leszczynski et al., 1999). Zhang et al.
(1995) has also used fuzzy clustering to predict protein structural class. Therefore, we used the fuzzy k-NN algorithm to
Measurement accuracy
We use jackknife test for cross-validation. In comparison with
subsampling test or independent data set test, the jackknife
test is thought to be more rigorous and reliable (Mardia et al.,
1979). Chou and Zhang (1995) also provided a comprehensive
discussion about this problem. During the process of jackknife
test, each protein is singled out in turn as a test sample, the
remaining proteins are used as training set to calculate test
samples membership and predict the class. The prediction
quality was evaluated by the overall prediction accuracy and
prediction accuracy for each location.
k
p(s)
overall accuracy = s=1
(2)
N
p(s)
accuracy(s) =
(3)
obs(s)
where N is the total number of sequences, k is the class number, obs(s) is the number of sequences observed in location
s and p(s) is the number of correctly predicted sequences in
location s.
The other measure of prediction accuracy is Matthews correlation coefficients (MCC) (Matthews, 1975) between the
observed and predicted locations over a data set, as given by:
MCC(s)
=
p(s)n(s) u(s)o(s)
.
(4)
23
predict subcellular locations. This method assigns fuzzy memberships of samples to different categories rather than a particular class as in k-NN. Here class memberships are assigned
to the test sample, according to the following relationship:
k
(j )
(j ) 2/(m1) )
j =1 ui (x )(x x
ui (x) =
i = 1, . . . , c
k
(j ) 2/(m1) )
j =1 (x x
(1)
where m is a fuzzy strength parameter, which determines how
heavily the distance is weighted when calculating each neighbors contribution to the membership value. The variable k
is the number of nearest neighbors, ui (x) is the membership
of the test sample x, to class i. x x (j ) is the distance
between the test sample x and its nearest training samples x (j ) .
Various distance measures can be used, such as Euclidean,
absolute and Mahalanobis distance measures. In the present
study, we used the Euclidean distance measure. ui (x (j ) ) is
the membership value of the j -th neighbor to the i-th class,
it can be assigned in several way. The crispest way is to
assign 1 if x (j ) belongs to i-th class otherwise assign 0. A
more fuzzy alternative is to assign the training samples
memberships based on the k-NN rule. In our analysis, we
define the membership via crispest way. After calculating
the memberships for the test sample, it is assigned to the class
with highest membership value.
Cellular location
Data_80
Accuracy (%)
MCC
Data_50
Accuracy (%)
MCC
Cytoplasm
Nuclear
Mitochondria
Extracellular
Golgi apparatus
Chloroplast
Endoplasmic reticulum
Cytoskeleton
Vacuole
Peroxisome
Lysosome
70.2
81.9
59.0
93.7
16.1
84.7
57.3
40.0
34.1
56.8
67.5
0.67
0.78
0.62
0.79
0.32
0.80
0.71
0.57
0.55
0.68
0.74
35.4
71.5
36.6
81.6
15.4
32.4
11.1
28.6
6.9
14.9
20.5
0.31
0.58
0.30
0.54
0.27
0.36
0.22
0.44
0.16
0.27
0.31
Overall accuracy
80.1
58.1
24
Cytop
Nuc
Mit
Ext
Gol
Chl
Endo
Cytos
Vac
Pero
Lyso
Sum
878 136
107 1762
71
37
43
53
5
7
26
11
19
6
1
4
3
6
11
4
5
5
SUM
Gol Chl Endo Cytos Vac Pero Lyso
51
146 1
44
195 2
408 119 0
19 2001 0
3
9 5
30
30 0
2
5 0
0
1 0
2
12 0
6
8 0
2
14 0
2540 8
29 2
40 1
54 0
11 0
0 2
546 0
2 47
0 0
1 1
6 0
1 0
0
1
0
0
0
0
0
4
0
0
0
0 5
0 0
0 3
2 0
0 0
0 2
0 0
0 0
14 0
0 46
0 0
3
0
0
6
0
0
1
0
2
0
56
1251
2152
692
2135
31
645
82
10
41
81
83
690 53
16 56
68
7203
Fig. 1. The dependence of the overall prediction accuracy on the number of nearest neighbors, k, used in the fuzzy k-NN classification (fuzzy
strength parameter m = 1.05). These results were obtained on Data_80 using the Euclidean distance measure.
25
Fig. 2. Average predictive accuracy related to RI. We also give fractions of sequences with various RI values. For example, about 5% of all
sequences have RI = 9, and of these sequences about 80% are correctly classified. The figure is based on Data_80.
Organism
Totalb
MEMBRANE
NUCLEAR
CYTOPLA
MITOCH
EXTRACELL
CHLOROPLAST
ORYSA
ARATH
YEAST
CAEEL
DROME
HOMO
9420
36 528
6905
20 887
19 978
44 402
1662 (17.6%c )
8282 (22.7%)
1489 (21.6%)
6572 (31.5%)
4294 (21.5%)
9146 (20.6%)
2088 (22.2%)
7431 (20.3%)
1997 (28.9%)
4316 (20.7%)
5914 (29.6%)
10 081 (22.7%)
1044 (11.1%)
5122 (14.0%)
876 (12.7%)
2405 (11.5%)
2558 (12.8%)
4639 (10.5%)
513 (5.5%)
2086 (5.7%)
470 (6.8%)
898 (4.3%)
1018 (5.1%)
1561 (3.5%)
3410 (36.2%)
10781 (29.5%)
1795 (26.0%)
5833 (27.9%)
5380 (26.9%)
17 978 (40.5%)
622 (6.6%)
2499 (6.8%)
221 (3.2%)
744 (3.6%)
587 (2.9%)
615 (1.4%)
Abbreviations for organism: ORYSA: Oryza sativa; ARATH: Arabidopsis thaliana; YEAST: Saccharomyces cerevisiae; CAEEL: Caenorhabditis elegans; DROME: Drosophila
melanogaster; HOMO: Homo sapiens.
a
Only annotation of six major subcellular location are listed.
b
Number of protein sequence in SWISS-PROT + TREMBL databank.
c
Fraction in total proteins.
26
CONCLUSION
In this paper, fuzzy k-NN method based on proteins dipeptide
composition was proposed for prediction of subcellular locations. An advantage of the new method is its incorporating
sequence-order effects into prediction. This method was
performed to a new data set derived from version 41.0
SWISS-PROT databank, and high predictive accuracy has
been achieved in a jackknife test. This indicates that extracting
Fig. 3. Average prediction accuracy was also calculated cumulatively with RI above a given value. For example, about 75% of all sequences
have RI 6, and of these sequences about 92% are correctly predicted. The result is based on Data_80.
Markov model
Accuracy MCC
(%)
SVM
Accuracy
(%)
MCC
Fuzzy k-NNa
Accuracy MCC
(%)
Cytoplasmic
Extracellular
Mitochondrial
Nuclear
78.1
62.2
69.2
74.1
0.60
0.63
0.53
0.68
76.9
80.0
56.7
87.4
0.64
0.78
0.58
0.75
86.7
83.7
60.4
92.0
0.76
0.87
0.63
0.83
Total accuracy
73.0
79.4
85.2
Location
In this test, fuzzy strength number m = 1.05, number of nearest neighbors k = 20.
ACKNOWLEDGEMENTS
The authors would like to thank Dr A. Reinhardt for providing his data set. Thanks to Jun Cai for helpful discussions and
Dr Liang Ji for valuable comments on the manuscript. We also
thank the anonymous reviewers for their helpful comments.
This work was funded by the National Natural Science Grant
in China (Nos 60171038 and 60234020) and the National
Basic Research Priorities Program of the Ministry of Science
and Technology (No. 2001CCA0). Y.H. also thanks Tsinghua
University Ph.D. Grant for the support.
REFERENCES
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J.
(1990) Basic local alignment search tool. J. Mol. Biol., 215,
403410.
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res., 25, 33893402.
Bezdek,J.C., Hall,L.O. and Clarke,L.P. (1993) Review of MR image
segmentation techniques using pattern recognition. Med. Phys.,
20, 10331048.
Boeckmann,B.,
Bairoch,A.,
Apweiler,R.,
Blatter,M.C.,
Estreicher,A.,
Gasteiger,E.,
Martin,M.J.,
Michoud,K.,
ODonovan,C., Phan,I., Pilbout,S. and Schneider,M. (2003)
The SWISS-PROT protein knowledgebase and its supplement
TrEMBL in 2003. Nucleic Acids Res., 31, 365370.
Cedano,J., Aloy,P., Perez-Pons,J.A. and Querol,E. (1997) Relation
between amino acid composition and cellular location of proteins.
J. Mol. Biol., 266, 594600.
27
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002) Support vector machines for prediction of protein subcellular location by
incorporating quasi-sequence-order effect. J. Cell. Biochem., 84,
343348.
Chou,K.C. (1995) A novel approach to predicting protein structural
classes in a (20-1)-D amino acid composition space. Proteins
Struct. Funct. Genet., 21, 319344.
Chou,K.C. (2000a) Prediction of protein subcellular locations by
incorporating quasi-sequence-order effect. Biochem. Biophys.
Res. Commun., 278, 477483.
Chou,K.C. (2000b) Review: prediction of protein structural classes
and subcellular locations. Curr. Protein Peptide Sci., 1, 171208.
Chou,K.C. (2001) Prediction of protein cellular attributes using
pseudo-amino acid composition. Proteins Struct. Funct. Genet.,
43, 246255.
Chou,K.C. (2002a) Prediction of protein signal sequences. Curr.
Protein Peptide Sci., 3, 615622.
Chou,K.C. (2002b) A new branch of proteomics: prediction of protein cellular attributes. In Weinrer,P.W. and Lu,Q. (eds), Gene
Cloning & Expression Technologies, Chapter 4. Eaton Publishing,
Westborough, MA, pp. 5770.
Chou,K.C. and Cai,Y.D. (2002) Using functional domain composition and support vector machines for prediction of protein
subcellular location. J. Biol. Chem., 277, 4576545769.
Chou,K.C. and Elrod,D.W. (1998) Using discriminant function
for prediction of subcellular location of prokaryotic proteins.
Biochem. Biophys. Res. Commun., 252, 6368.
Chou,K.C. and Elrod,D.W. (1999a) Protein subcellular location
prediction. Protein Eng., 12, 107118.
Chou,K.C. and Elrod,D.W. (1999b) Prediction of membrane protein
types and subcellular locations. Proteins Struct. Funct. Genet., 34,
137153.
Chou,K.C. and Zhang,C.T. (1995) Review: prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol., 30,
275349.
Drawid,A. and Gerstein, M. (2000) A Bayesian system integrating
expression data with sequence patterns for localizing proteins:
comprehensive application to the yeast genome. J. Mol. Biol.,
301, 10591075.
Duda,R.O., Hart,P.E. and Stork,D.G. (2000) Pattern Classification,
2nd edn. Wiley, New York.
Emanuelsson,O., Nielsen,H., Brunak,S. and von Heijne,G. (2000)
Predicting subcellular localization of proteins based on their
N-terminal amino acid sequence. J. Mol. Biol., 300, 10051016.
van Heel,M. (1991) A new family of powerful multivariate
statistical sequence analysis techniques. J. Mol. Biol., 220,
877887.
von Heijne,G. (1992) Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J. Mol. Biol., 225,
487494.
Hirokawa,T., Boon-Chieng,S. and Shigeki,M. (1998) SOSUI: classification and secondary structure prediction system for membrane
proteins. Bioinformatics, 14, 378379.
Hua,S. and Sun,Z. (2001) Support vector machine approach for
protein subcellular localization prediction. Bioinformatics, 17,
721728.
Keller,J.M., Gray,M.R. and Givens,J.A. (1985) A fuzzy k-nearest
neighbour algorithm. IEEE Trans. Syst. Man Cybern., 15,
580585.
28
Nakashima,H. and Nishikawa,K. (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol., 238,
5461.
Nielsen,H., Engelbrecht,J., Brunak,S. and von Heijne,G. (1997)
Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng.,
10, 16.
Nielsen,H., Brunak,S. and von Heijne,G. (1999) Machine learning
approaches for the prediction of signal peptides and other protein
sorting signals. Protein Eng., 12, 39.
Reinhardt,A. and Hubbard,T. (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res.,
26, 22302236.
Rost,B., Fariselli,P. and Casadio,R. (1996) Topology prediction for
helical transmembrane proteins at 86% accuracy. Protein Sci., 5,
17041718.
Yuan,Z. (1999) Prediction of protein subcellular locations using Markov chain models. FEBS Lett., 451,
2326.
Wang,H.C., Dopazo,J., de la Fraga,L.G., Zhu,Y.P. and Carazo,J.M.
(1998) Self-organizing tree-growing network for the classification
of protein sequences. Protein Sci., 7, 26132622.
Wu,C., Whitson,G., McLarty,J., Ermongkonchai,A. and Chang,T.C.
(1992) Protein classification artificial neural system. Protein Sci.,
1, 667677.
Zhang,C.T., Chou,K.C. and Maggiora,G.M. (1995) Predicting protein structural classes from amino acid composition: application
of fuzzy clustering. Protein Eng., 8, 425435.
Zhou,G.P. and Assa-Munt,N. (2001) Some insights into protein
structural class prediction.Proteins Struct. Funct. Genet., 44,
5759.
Zhou,G.P. and Doctor,K. (2003) Subcellular location prediction
of apoptosis proteins. Proteins Struct. Funct. Genet., 50,
4448.