Sunteți pe pagina 1din 2

BIOINFORMATICS APPLICATIONS NOTE

Sequence analysis

Vol. 21 no. 7 2005, pages 12691270 doi:10.1093/bioinformatics/bti130

NetAcet: prediction of N-terminal acetylation sites


Lars Kiemer, Jannick Dyrlv Bendtsen and Nikolaj Blom
Center for Biological Sequence Analysis, BioCentrum-DTU, Building 208, Technical University of Denmark, DK-2800 Lyngby, Denmark
Received on September 03, 2004; revised and accepted on October 28, 2004 Advance Access publication November 11, 2004

Downloaded from http://bioinformatics.oxfordjournals.org/ at Makerere University on November 25, 2011

ABSTRACT Summary: We present here a neural network based method for prediction of N-terminal acetylationby far the most abundant posttranslational modication in eukaryotes. The method was developed on a yeast dataset for N-acetyltransferase A (NatA) acetylation, which is the type of N-acetylation for which most examples are known and for which orthologs have been found in several eukaryotes. We obtain correlation coefcients close to 0.7 on yeast data and a sensitivity up to 74% on mammalian data, suggesting that the method is valid for eukaryotic NatA orthologs. Availability: The NetAcet prediction method is available as a public web server at http://www.cbs.dtu.dk/services/NetAcet/ Contact: nikob@cbs.dtu.dk Supplementary information: http://www.cbs.dtu.dk/services/ NetAcet/

The method presented here only deals with NatA N -terminal acetylation and not acetylation on the -amino group of internal lysine residues by other acetyltransferases.

METHODOLOGY

INTRODUCTION

Most proteins undergo post-translational modications (PTM), which for example can be addition of chemical groups as seen for acetylation or glycosylation, or removal of a few or more amino acids by maturation or signal peptide cleavage. N-terminal acetylation is one of the most common modications found in eukaryotes and is also found in archaea and bacteria although less frequently. N-terminal acetylation occurs co-translationally on eukaryotic cytoplasmic proteins and the prevalence is estimated at 8090% in mammals and 50% in yeast (Polevoda and Sherman, 2000, 2003). N-terminal acetylation is a common PTM, for which prediction has been extremely difcult to approach owing to lack of data and a clear consensus motif (Polevoda and Sherman, 2003). Almost 20 years ago, an attempt was made to predict N-terminal acetylation in general, in which the predicted protein secondary structure was used as input to a linear neural network (Augen and Wold, 1986). Performance evaluation is impossible though, as the model was constructed and tested on the same dataset. With yeast being one of the most thoroughly studied eukaryotes a sufcient amount of data has accumulated to allow for the training of a prediction method for NatA acetylation. Yeast is sufciently important to warrant the construction of a prediction server, but additionally, the predictor seems to obtain comparable performance values on mammalian proteins. This supports the idea that the N-terminal acetylation systems are similar to all eukaryotes.

All training data were extracted from Table 2 in Polevoda and Sherman (2003) and joined with data from the Yeast Protein Map (YPM) resource (Perrot et al., 1999). Any inconsistencies between the two datasets were removed to obtain the highest quality data possible. Furthermore, we extracted only substrates reported to be acetylated by NatA, as this is the only transferase on which a sufcient amount of data has been accumulated so far. This resulted in 61 positive and 76 negative training sequences (Fig. 1). Sequences were truncated to their N-terminal 40 residues and subsequently homology reduced by visual inspection of a neighbour-joining tree generated from a ClustalW multiple alignment (Thompson et al., 1994). Four sequences from the positive training data and four sequences from the negative training data were removed due to homology and following this reduction, the two closest homologues were 52% identical although the average homology was much lower (data les and trees are available as Supplementary Material from http://www.cbs.dtu.dk/services/NetAcet/). An articial neural network was trained using three-fold crossvalidation. As negative examples, all positions in the dataset, except those known to be acetylated, were used. For evaluation purposes, however, only negative examples having either serine, threonine, alanine or glycine in the rst position of the network window were used as the other types were trivial. The neural network used in this work was of the standard feed-forward type, and sparse encoding was used for translating the amino acids to data input for the networks as has been described previously (Blom et al., 1996; Nielsen et al., 1997).

RESULTS AND DISCUSSION

To

whom correspondence should be addressed.

Using a network window size of seven amino acids (corresponding to positions 17 in Fig. 1) and eight hidden neurons we were able to obtain a Matthews correlation coefcient (Matthews, 1975) of 0.69 when using a threshold of 0.5. This correlation coefcient reects a sensitivity of 75% and a specicity of 92% (Fig. 2). A smaller or larger window size gave lower specicity. As expected, specicity on negative examples with a serine residue at position 1 is lower (60%) than average reecting that these sequences are more difcult due to the bias of the positive examples (Fig. 1). Although a linear neural network (i.e. without hidden neurons) obtained a

The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org

1269

L.Kiemer et al.

0.8

0.6

0.4
Correlation coefficient Sensitivity Specificity

0.2

0 0.45

Downloaded from http://bioinformatics.oxfordjournals.org/ at Makerere University on November 25, 2011

0.46

0.47

0.48 0.49 Threshold

0.5

0.51

0.52

Fig. 2. Method performance, showing specicity, sensitivity and Matthews correlation coefcient on yeast test data. Values were plotted versus neural network output threshold.

website further to include prediction methods for other types of acetylations as sufcient data becomes available.
Fig. 1. Shannon information (Shannon, 1948) sequence logo of 57 yeast NatA acetylation sites (upper) and a KullbackLeibler (Cover and Thomas, 1991) logo (lower) constructed from both the 57 known sites and from 55 known non-acetylated N-terminal sites containing S/T/A/G in the rst position (used as background distribution). Acetylation is reported on position 1 in the logos. Data were extracted as stated in the Methodology section. The height of the columns of letters reects the degree of sequence conservation for the positive examples in the upper logo. In the lower logo the column height reects the discrepancy between the positive and negative examples. Note that the bit scale differs in the two logos. Sequence logos were constructed as described by Schneider and Stephens (1990).

ACKNOWLEDGEMENTS
This work was supported by grants from the Danish National Research Foundation, the Danish Natural Science Research Council, the Danish Center for Scientic Computing, the European Union BioSapiens Network of Excellence (to J.D.B.), and NeuroSearch A/S (to L.K).

REFERENCES
Apweiler,R., Bairoch,A., Wu,C.H., Barker,W.C., Boeckmann,B., Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al. (2004) Uniprot: the universal protein knowledgebase. Nucleic Acids Res., 32(Database issue), D115D119. Augen,J. and Wold,F. (1986) How much sequence information is needed for the regulation of amino-terminal acetylation of eukaryotic proteins? Trends Biochem. Sci., 11, 494497. Blom,N., Hansen,J., Blaas,D. and Brunak,S. (1996) Cleavage site analysis in picornaviral polyproteins: discovering cellular targets by neural networks. Protein Sci., 5, 22032216. Cover,T.M. and Thomas,J.A. (1991) Elements of Information Theory. John Wiley and Sons, Inc, New York. Matthews,B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta., 405, 442451. Nielsen,H., Engelbrecht,J., Brunak,S. and von Heijne,G. (1997) Identication of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng., 10, 16. Perrot,M., Sagliocco,F., Mini,T., Monribot,C., Schneider,U., Shevchenko,A., Mann,M., Jeno,P. and Boucherie,H. (1999) Two-dimensional gel protein database of saccharomyces cerevisiae (update 1999). Electrophoresis, 20, 22802298. Polevoda,B. and Sherman,F. (2000) N -terminal acetylation of eukaryotic proteins. J. Biol. Chem., 275, 3647936482. Polevoda,B. and Sherman,F. (2003) N-terminal acetyltransferases and sequence requirements for N-terminal acetylation of eukaryotic proteins. J. Mol. Biol., 325, 595622. Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., 18, 60976100. Shannon,C.E. (1948). A mathematical theory of communication. Bell System Tech. J., 27, 379423/623656. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specic gap penalties and weight matrix choice. Nucleic Acids Res., 22, 46734680.

comparable correlation coefcient, a more sophisticated network containing eight hidden neurons was preferred as this performed far better on negative examples containing a serine residue at position 1 (the number of false-positive predictions on serine residues dropped from 12 to 6). On an independent test set of mammalian N-acetylated proteins, which was created by extraction from UniProt (Apweiler et al., 2004), we obtained a sensitivity of 74% on acetylated serines (71 were found of 96 possible). While the gures for serine acetylation prediction are comparable in yeast and mammalian data, we obtained a lower performance on other types of substrates, which we attribute to the relatively few examples of such sites available in the yeast training data. However, it does seem that yeast NatA and mammalian NatA orthologs share properties in their substrate specicity.

CONCLUSION

The method presented here predicts acetylation sites of NatA in yeast with high performance and also to a certain extent those in mammals. We believe that the method will be highly useful to researchers working with acetylation as well as facilitate the on-going work on proteome annotation. The prediction server and additional information is available at http://www.cbs.dtu.dk/services/NetAcet/. We plan to evolve this

1270

S-ar putea să vă placă și