Multiclass Protein Fold Classification

Multiclass Protein Fold Recognition
Multiclass Protein Fold Classification

SMAI Final Project Report
Achyuth Narayan S
Sri Harsha Vogeti
Sricharan Thota
201164157
201064120
201256001
Abstract Protein fold recognition is an important

approach to structure discovery. This method gives us a new
way of classifying protein functions without relying on
sequence similarity. We study this approach with new multiclass classification methods and examine the accuracy of these
methods to determine the functions of the proteins being
tested.
In this Paper we determined the protein using a variety of
features using the one -against-all methods. Both improve
prediction accuracy by a good extent and we used a dataset
containing 27 SCOP folds. We used the Support Vector
Machine (SVM) method as the base classifier method. SVMs
converges fast and leads to high accuracy. When scores of
multiple parameter datasets are combined, majority voting
reduces noise and increases recognition accuracy. We
examined many issues involved with large number of classes,
including dependencies of prediction accuracy on the number
of folds and the predictions of secondary structures. We used
new features such as unigram and bigram to improve accuracy
of the systems and get a better accuracy of protein prediction.
Overall, our recognition system achieved 50.65% fold
prediction accuracy on a protein test dataset, where most of
the proteins have below 25% sequence identity with the
proteins used in training.
The protein parameter datasets used in this paper are available
online at:
KeywordsSVM; dataset; protein; classification; function;
I.
INTRODUCTION
Computational analysis of biological data obtained in genome

sequencing and other projects is essential for understanding
cellular function and the discovery of new drugs and therapies.
Sequencesequence and sequence structure comparison play
a critical role in predicting a possible function for new
sequences. When the sequence similarity is low, sequence sequence comparison is of no use. This paper solves that
problem using protein fold sequence recognition. Pairwise
sequence alignment is accurate in detecting close evolutionary
relationship between proteins (Holm and Sander, 1999), but it
is not efficient when two proteins are structurally similar, but
have no significant sequence similarity. The threading (Protein
fold recognition) approach has demonstrated promising results
in detecting the latter type of relationship (Jones, 1999).
In this paper, we focus on the taxonometric approach in

determining structure similarity without sequence similarity,
using machine learning methods (Baldi and Brunak, 1998;
Durbin et al., 1998), using Support Vector Machines (SVMs).
This approach has achieved some success mostly through
recognition of protein fold, which is a common 3-dimensional
pattern with the same major secondary structure elements in
the same arrangement and with the same topological
connections (Craven et al., 1995). The taxonometric approach
presumes that the number of folds is restricted and thus the
focus is on structural predictions in the context of a particular
classification of 3-D folds. Detailed, comprehensive protein
classifications such as SCOP (Lo Conte et al., 2000) and
CATH (Pearl et al., 2000) identified more than 600 3-D
protein folding patterns. The high number of classes in
proteins makes this an extremely challenging problem. The
more classes are involved, the more difficult it is to accurately
predict the fold for a query sequence. We use a one vs all
comparison method to compare between the protein classes.
However, these new methods, essentially based on all pairs of
individual classes, require building very large number of
discriminative classifiers, (about 84 000 in our database of 27
folds). We overcome this difficulty by using the newly
developed SVM. SVM is a new discriminative method
(Vapnik, 1995), which has demonstrated high classification
accuracy in protein family (evolutionary relationship)
prediction (Jaakkola et al., 1999), gene expression
classification (Brown et al., 2000), and many other areas
beyond molecular biology. An advantage of SVM is its fast
convergence in training, about 10100 faster than in
NN.Using SVM training we were able to classify the protein
classes and accurately guess the function of the novel protein
as compared to the known proteins.
II.
APPROACH & METHODOLOGY
A. Support Vector Machine

SVM was used as the classifier in our project. SVM is a new
and promising binary classification method developed by
Vapnik and colleagues at Bell Laboratories (Vapnik, 1995;
Burges, 1998), with algorithm improvements by others (Osuna
et al., 1997; Joachims, 1998). SVM is a margin classifier. It
draws an optimal hyper plane in a high-dimensional feature
space.

the most significant feature as the similarity of the sequence
gives us a lot of details about the probable function of the
protein. Even though we have restricted the similarity of
proteins to 25%, amino acid composition of the protein is still
an important feature.
Figure 1: 1 CGT Protein
1 XYZ Protein
SVM defines a boundary that maximizes the margin between

data samples in two classes, therefore giving good
generalization properties. In our project experiments were
conducted with RBF and linear kernel. We found that RBF
kernel performs better than linear kernel in all the different
experiments conducted.
Graph 1: Sample Number vs Folds in the Dataset
We used one-vs-other method for classification. one-vsone approach in a multi-class classification takes time of
order O(c^2) for both testing and training, where c is the
number of classes. The time taken by one-vs-one approach is
very high, but will have an improved accuracy over one-vsother. On the other hand one-vs-other takes time of the
order O(c) for both training and testing. This is very apt for
fold prediction systems.
B. Data Set
The dataset used in our project was provided by Professor
Hong-bin Shenof Shanghai Jiao Tong University. The training
dataset consists of 317 proteins belonging to 27 different
folds. In the training data no two proteins have no more than
35% of the sequence identity for the aligned subsequences
longer than 80 residues. The test data had 383 proteins
belonging to the 27 folds that were represented in the training
data. This test data contains protein sequences having less than
40% identity with each other. The sequence similarity between
test and train proteins is less than 25% in most of the cases.
The train data is very skewed. On one hand we have only 8
samples for one of the labels and on the other hand we have 30
for some other label. Below is the graph of number of samples
versus the fold number (label) .The folds that were represented
in the entire project are tabulated below, along with their
corresponding labels. To remove the imbalance we introduced
weights while training the data. A ten-fold cross-validation
was also done and best parameters were given during the
training of the data.
C. Features
Amino acid composition: This feature deals with the
composition of the protein and its amino acid sequence. It is
Figure 2: Classes and Proteins in the respective classes

Predicted secondary structure: Secondary structure prediction
is a set of techniques in bioinformatics that aim to predict the
secondary structures of proteins and nucleic acid sequences

based only on knowledge of their primary structure. For
proteins, this means predicting the formation of protein
structures such as alpha helices and beta strands, while for
nucleic acids it means predicting the formation of nucleic acid
structures like helixes and stem-loop structures through base
pairing and base stacking interactions.
Hydrophobicity: In the case of protein folding, the
hydrophobic effect is important to understand the structure of
proteins that have hydrophobic amino acids, such as alanine,
valine, leucine, isoleucine, phenylalanine, tryptophan and
methionine clustered together within the protein. Structures of
water-soluble proteins have a hydrophobic core in which side
chains are buried from water, which stabilizes the folded state,
and charged and polar side chains are situated on the solventexposed surface where they interact with surrounding water
molecules. Minimizing the number of hydrophobic side chains
exposed to water is the principal driving force behind the
folding process, although formation of hydrogen bonds within
the protein also stabilizes protein structure.
Figure 3. Features in Set 1 along with the Dimensions

Normalized Vander Waals volume: The van der Waals
volume, Vw, also called the atomic volume or molecular
volume, is the atomic property most directly related to the van
der Waals radius. It is the volume "occupied" by an individual
atom (or molecule). The van der Waals volume may be
calculated if the van der Waals radii (and, for molecules, the
inter-atomic distances and angles) are known. For a spherical
single atom, it is the volume of a sphere whose radius is the
van der Waals radius of the atom:
For a molecule, it is the volume enclosed by the van der Waals
surface. The van der Waals volume of a molecule is always
smaller than the sum of the van der Waals volumes of the
constituent atoms: the atoms can be said to "overlap" when
they form chemical bonds.
Polarity: proteins are polymers with individual subunits

(amino acids joined together in amide linkages) which may be
highly polar, highly non-polar, or intermediate. The most polar
or non-polar parts are the "side chains". 2. Proteins usually
fold with the polar side chains out toward the water and the
non-polar side chains toward the interior. But there are usually
polar groups in the interior as well. The main chain of amide
bonds fundamentally polar but is quite happy to be buried in
the interior because the polar groups associate to make the

"secondary structures" such as helices ("alpha") or sheets
("beta") 3. Some proteins have patches of non-polar surface
exposed - these are often involved in binding other proteins,
membranes, non-polar molecules, etc. 4. Some proteins, such
as some storage proteins and intrinsic membrane proteins,
have large areas of non-polar surface.
Figure 4. Features in Set 2 along with their dimensions.

Polarizability: Polarizability is the ability for a molecule to be
polarized. It is a property of matter. Polarizability determines
the dynamical response of a bound system to external fields,
and provide insight into a molecule's internal structure.
Chou Fasman method: The ChouFasman method is an
empirical technique for the prediction of secondary structures
in proteins, originally developed in the 1970s by Peter Y.
Chou and Gerald D. Fasman. The method is based on analyses
of the relative frequencies of each amino acid in alpha helices,
beta sheets, and turns based on known protein structures
solved with X-ray crystallography. From these frequencies a
set of probability parameters were derived for the appearance
of each amino acid in each secondary structure type, and these
parameters are used to predict the probability that a given
sequence of amino acids would form a helix, a beta strand, or
a turn in a protein.
N-gram Method: In the fields of computational linguistics and
probability, an n-gram is a contiguous sequence of n items
from a given sequence of text or speech. The items can be
phonemes, syllables, letters, words or base pairs according to
the application. The n-grams typically are collected from a
text or speech corpus.
An n-gram of size 1 is referred to as a "unigram"; size 2 is a
"bigram" (or, less commonly, a "digram"); size 3 is a
"trigram". Larger sizes are sometimes referred to by the value
of n, e.g., "four-gram", "five-gram", and so on.
An n-gram model is a type of probabilistic language model for
predicting the next item in such a sequence in the form of a (n
- 1)order Markov model. N-gram models are now widely
used in probability, communication theory, computational
linguistics (for instance, statistical natural language
processing), computational biology (for instance, biological
sequence analysis), and data compression. The two core
advantages[compared to?] of n-gram models (and algorithms
that use them) are relative simplicity and the ability to scale up
by simply increasing n a model can be used to store more
context with a well-understood spacetime tradeoff, enabling
small experiments to scale up very efficiently.

In this paper we used Unigram, Bigram, Trigram and
combinations of the three.
D. Feature Extraction
Our approach uses a combination of local and global
information about amino acid sequences. Since all other
methods of pattern recognition require property vectors as
input, a sequence of amino acids should be replaced by a
sequence of symbols representing local physicochemical
properties. A protein sequence is represented by a set of
parameter vectors based on various physico-chemical and
structural properties of amino acids along the sequence.
groups according to the magnitudes of their numerical values.

The ranges of these numerical values and the amino acids
belonging to each group are shown in Table I.
TABLE I RANGES OF VALUES
These parameter vectors were constructed in two steps:

Step 1. The sequence of the amino acids was transformed into
sequences of certain physico-chemical or structural properties
(attributes) of residues. Twenty amino acids were divided into
three groups for each of six different amino acid attributes
representing the main clusters of the amino acid indices of
Tomii and Kanehisa. Thus, for each attribute, every amino
acid was replaced by the index 1, 2, or 3 according to one of
the three groups to which it belonged.
For each of the attributes we have chosen i.e., for
hydrophobicity, normalized van der Waals volume, polarity,
and Polarizability, the 20 amino acids were divided into three
Step 2. Three descriptors, composition (C), transition

(T), and distribution (D), were calculated for a given
attribute to describe the global percent composition of each of
the three groups in a protein, the percent frequencies
Figure 5. Overall Flow Diagram
Figure 6. Chou Fasman Parameters for all amino acids

With which the attribute changes its index along the entire
length of the protein, and the distribution pattern of the
attribute along the sequence, respectively.
Figure 7. A Model Sequence

Let us consider the hydrophobicity attribute as an example.
All amino acids are divided into three groups polar, neutral,
and hydrophobic. The composition descriptor C consists of the
three numbersthe global percent compositions of polar,
neutral, and hydrophobic residues in the protein. The
transition descriptor T also consists of the three numbersthe
percent frequency with which: 1) a polar residue is followed
by a neutral residue or a neutral residue by a polar residue; 2)
a polar residue is followed by a hydrophobic residue or a
hydrophobic residue by a polar residue; and 3) a neutral
residue is followed by a hydrophobic residue or a hydrophobic
residue by a neutral residue.
The distribution descriptor D consists of the five numbers for

each of the three groups: the fractions of the entire sequence,
where the first residue of a given group is located, and where
25%, 50%, 75%, and 100% of those are contained. Thus, the
complete parameter vector contains 21 components. Lets take
an example to understand this better. Consider a model
sequence as shown in figure 4.
In this sequence there are 10 A (n1=10) type residues and 16
B (n2=16) type residues. So the composition descriptors can
be calculated as n1*100.0/(n1+n2) = 38.5% for type A and
n2*100.0/(n1+n2) = 61.5% for type B. The transition
descriptors, characterize the frequency with which A is
followed by B or B is followed by A. In this case there are ten
such transitions and therefore, it can be calculated as
(10/25)*100.0 = 40%. The third descriptor is a bit more
complicated. For a given property of amino acids, the
distribution of the property along the protein chain is
described by five chain lengths (in percent), within which the
first, 25%, 50%, 75%, and 100% of the amino acids with a
certain property are contained. In the example of Fig.1, the
first residue of group A coincides with the beginning of the
chain, so the first number of D descriptor equals 0.0. Twentyfive percent of all group A residues (rounded to 2 residues) are
contained within the first 4 residues of the protein chain, so
the second number equals (4/26)x100.0%=15.4%. Similarly,
50% of group A residues are within the first 12 residues of the
chain; thus, the third number is (12/26)x100.0%=46.1%. The
fourth and fifth numbers of the distribution descriptor are
73.1% and 100%, respectively. Analogous numbers for group
B are 7.5%, 23.1%, 53.8%, 79.9%, and 92.3%, respectively.

This way descriptors are calculated for each of the four
descriptors (i.e., hydrophobicity, normalized van der Waals
volume, polarity, and polarizability). Along with these four
descriptors we have used two more features based on
secondary structures and Chou Fasman parameters.
results. The accuracies of each protein fold are compared in the

graphs.
Predicted Secondary Structure: The attributes we have used

included the predicted secondary structure. Here the indices 1,
2, 3 correspond to the helix strand and coil respectively.
Chou Fasman Parameters: The Chou-Fasman method of
secondary structure prediction depends on assigning a set of
prediction values to a residue and then applying a simple
algorithm to those numbers. The table of numbers is as
follows:
To identify a bend at residue number j, we calculate the
following value p(t) = f(j)f(j+1)f(j+2)f(j+3) where the f(j+1)
value for the j+1 residue is used, the f(j+2) value for the j+2
residue is used and the f(j+3) value for the j+3 residue is used.
We calculate the sum of p(t) for an amino acid in the protein
sequence and normalize it with respect to length of the protein
to make it independent of the length.
Graph 1. Linear Kernel ( Percentage vs Fold Number)

In the above graph, The percentage of class wise accuracy
are compared to the Fold number among 1 27 folds as seen in
Figure 7. We can see that Fold no 2 has the highest percentage
of accuracy as compared to others for the Linear kernel.
E. Ensemble
Ensemble Approach was used for the prediction of fold for
Set-1(Not for Set-2). In ensemble approach, the testing and
training for each feature is done independently and the votes
obtained for each label for each feature were added and finally
the label with the highest number of votes is considered as the
predicted output.
III.
RESULTS
As part of our first experiment we used features 1-6 from the

set1. These are standard set of features that are used in fold
prediction. In our second experiment we added feature 7 in set
1 to the already present 6 features and performed the
experiment. Feature 7 (Chou Fasman Parameters) gives more
insight into the secondary structure of the protein. Therefore
our prediction accuracy improved after addition of feature 7.
In our next set of experiments we used N-gram as our feature.
We went till 3-gram. The unigram + bigram feature gave us
the best prediction accuracy in our entire project. The results
obtained in various experiments in table 2.
TABLE 2: FEATURES AND ACCURACIES
Graph 2. RBF Kernel (Percentage vs Fold Number)

of accuracy as compared to others for the RBF kernel.
The Set 1 features consist of the 7 features listed in Figure
2. These features 1-6 are the most common comparisons used
for determining protein functions.
The Set 2 features consist of the 3 features listed in Figure
3. These features have been used in various combinations to
obtain the accuracy given in the graphs.
A. Graphs
These are the following graphs we have drawn using our
results we got by comparing Linear kernel and RBF kernel
In the below graph, The percentage of class wise accuracy

of accuracy as compared to others for the Linear kernel.

of the classes (example label 2) in unigram + bigram has come
down significantly. This is evident when we compare graphs 1
and 3 and graphs 2 and 4.
The database we used was one of the standard bases. But it
was an old data set. SCOP database which has structural
information about the proteins including fold related
information has grown exponentially over the years. Today, it
is possible to come up with a better training dataset than we
used. A more comprehensive training dataset, which is free of
imbalance in representation of different folds, would definitely
help to improve the prediction accuracy of the methods
described.
Graph 3. Linear Kernel (Percentage vs Fold Number) -Set

2
IV. ACKNOWLEDGMENT
We thank our Instructor, Professor Anoop Namboodiri for
allowing us to work on whatever we are interested in, which
ended with us choosing this topic in which we are interested in.
We also would like to thank our mentor, Siddharth Goyal for
giving us valuable suggestions and helping us while we were
stuck.
V.
Ding, C., and Dubchak, I., 2001: Multi-class protein fold

recognition using support vector machines and neural
networks, Bioinformatics, 17, 349-358
Chou PY, FasmanGD. "Prediction of the secondary

structure of proteins from their amino acid sequence".
AdvEnzymolRelatAreas MolBiol47: 45148
Dubchak,I., Muchnik,I., Holbrook,S.R. and Kim,S.H.

(1995) Prediction of protein folding class using global
description of amino acid sequence. Proc. NatlAcad. Sci.
USA, 92, 87008704.
Nakashima H, Nishikawa K, Ooi T. The folding type of a

protein is relevant to the amino acid composition. J
Biochem 1986;99:152162
Chou K-C, Zhang C-T. Prediction of protein structural

classes.Crit Rev Biochem Mol Biol 1995;30:275349
Charton M, Charton BI. The structural dependence of

amino acid hydrophobicity parameters. J Theor Biol
1982;99:629644
Dubchak I, Muchnik I, Holbrook SR, Kim, S-H. Prediction

of protein folding class using global description of amino
acid sequence. Proc Natl Acad Sci USA 1995;92:8700
8704
Tomii K, Kanehisa M. Analysis of amino acid indices and

mutation matrices for sequence comparison and structure
prediction of proteins. Protein Eng. 1996;9:2736
Chothia C, Finkelstein AV. The classification and origins

of protein folding patterns. Annu Rev Biochem
1990;59:10071039
Graph 4. RBF Kernel (Percentage vs Fold Number) Set 2

of accuracy as compared to others for the RBF kernel.
B. Discussions
All the features mentioned in set-1 are based upon the
chemical and physical properties of the amino acids and are
very often used in fold-recognition system. On the other hand
features in set-2 are just based on the statistics. By looking at
results table above, it is very clear that RBF kernel performs
well when compared to linear kernel. In all the experiments
RBF kernels prediction accuracy was more than that of linear
kernels, except the unigram + bigram + trigram features.
Chou-Fasman parameter gives us more information about
whether an amino acid is in a bent or not. This gives us more
information about the secondary structure of the protein.
Hence adding Chou-Fasman improved our accuracy which can
be clearly seen in Table 2. Unigram + bigram feature gave us
the best accuracy overall. The accuracy is around 50.65%
which more than 3% higher than the accuracy achieved by
Set1 features. Even though the overall prediction accuracy is
higher for unigram + bigram, the class-wise accuracy of some
REFERENCES
Marchler-Bauer A, Bryant, SH. A measure of success in

fold recognition. Trends Biochem Sci 1997;22:236240.
Bowie JU, Luthy R, Eisenberg D. A method to identify

protein sequences that fold into a known three-dimensional
structure. Science 1991;253:164169
Hubbard TJP, Bart A, Brenner SE, Murzin AG, Chothia C.

SCOP: a structural classification of proteins database.
Nucleic Acids Res 1999;27:254256
Orengo CA, Flores TP, Taylor WR, Thornton JM.

Identification and classification of protein fold families.
Protein Eng 1993; 6:485500.

Multiclass Protein Fold Classification

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Multiclass Protein Fold Classification

Încărcat de

Drepturi de autor:

Formate disponibile

Multiclass Protein Fold Recognition