Documente Academic
Documente Profesional
Documente Cultură
Sricharan Thota
201164157
201064120
201256001
I.
INTRODUCTION
1 XYZ Protein
We used one-vs-other method for classification. one-vsone approach in a multi-class classification takes time of
order O(c^2) for both testing and training, where c is the
number of classes. The time taken by one-vs-one approach is
very high, but will have an improved accuracy over one-vsother. On the other hand one-vs-other takes time of the
order O(c) for both training and testing. This is very apt for
fold prediction systems.
B. Data Set
The dataset used in our project was provided by Professor
Hong-bin Shenof Shanghai Jiao Tong University. The training
dataset consists of 317 proteins belonging to 27 different
folds. In the training data no two proteins have no more than
35% of the sequence identity for the aligned subsequences
longer than 80 residues. The test data had 383 proteins
belonging to the 27 folds that were represented in the training
data. This test data contains protein sequences having less than
40% identity with each other. The sequence similarity between
test and train proteins is less than 25% in most of the cases.
The train data is very skewed. On one hand we have only 8
samples for one of the labels and on the other hand we have 30
for some other label. Below is the graph of number of samples
versus the fold number (label) .The folds that were represented
in the entire project are tabulated below, along with their
corresponding labels. To remove the imbalance we introduced
weights while training the data. A ten-fold cross-validation
was also done and best parameters were given during the
training of the data.
C. Features
Amino acid composition: This feature deals with the
composition of the protein and its amino acid sequence. It is
E. Ensemble
Ensemble Approach was used for the prediction of fold for
Set-1(Not for Set-2). In ensemble approach, the testing and
training for each feature is done independently and the votes
obtained for each label for each feature were added and finally
the label with the highest number of votes is considered as the
predicted output.
III.
RESULTS
A. Graphs
These are the following graphs we have drawn using our
results we got by comparing Linear kernel and RBF kernel
IV. ACKNOWLEDGMENT
We thank our Instructor, Professor Anoop Namboodiri for
allowing us to work on whatever we are interested in, which
ended with us choosing this topic in which we are interested in.
We also would like to thank our mentor, Siddharth Goyal for
giving us valuable suggestions and helping us while we were
stuck.
V.
REFERENCES