Documente Academic
Documente Profesional
Documente Cultură
www.seipub.org/rbb
ASurvey:EvaluationofEnsembleClassifiers
andDataLevelMethodstoDealwith
ImbalancedDataProbleminProteinProtein
Interactions
SeenaMaryAugusty*1,SminuIzudheen2
DeparmentofComputerScience,RajagiriSchoolofEngineeringandTechnology,India
*1
seenamaryaugusty@gmail.com;2sminu_i@rajagiritech.ac.in
each has its own advantage over the other. The
significant difficulty and frequent occurrence of the
class imbalance problem indicate the need for extra
research efforts. This paper extensively evaluates
recentdevelopmentsinthefieldofsolvingimbalanced
data problem and subsequently classifying the new
solutions under each category. Finally proposing a
slight enhancement for the solution of integrated
cluster under sampling with ensemble classifiers,
replacing bagging and Adaboost with Random forest
for the paper (yongqing et al 2012).The combining
methodemployedisMajorityvotingofallthedecision
trees.
Abstract
Overthepastfewdecades,proteininteractionshavegained
importance in many applications of prediction and data
mining. They aid in cancer prediction and various other
disease diagnosis. Imbalanced data problem in protein
interactions can be resolved both at data as well as
algorithmiclevels.Thispaperevaluatesandsurveysvarious
methods applicable at data level as well as ensemble
methodsatalgorithmiclevel.Clusterbasedundersampling,
over sampling along with data based methods were
evaluated under Data level. Ensemble classifiers were
evaluated at the algorithmic level. Unstable base classifiers
such as SVM and ANN can be employed for ensemble
classifiers such as Bagging, Adaboost, Decorate, Ensemble
nonnegativematrixfactorizationandsoon.Randomforest
can improve the ensemble classification in dealing with
imbalanceddataproblemoverBaggingaswellasAdaboost
methodforhighdimensionaldata.
Keywords
ImbalancedDataProbleminVariousDisciplines
Bagging;Adaboost;Decorate;Oversampling;Undersampling
Introduction
Proteinprotein interactions play numerous important
roles in many cellular functions, including DNA
replication, transcription and translation, signal
transduction, and metabolic pathways. Thereby,
aidingindiagnosisandprognosisofvariousdiseases.
Recently, a huge increase in the number of protein
protein interactions has made the prediction of
unknown proteinprotein interactions important for
the understanding of living cells. However, the
proteinprotein interactions experimentally obtained
so far are often incomplete and contradictory and,
consequently, computational methods now have an
upper hand in predictions. These prediction methods
have been reviewed for classification under which
www.seipub.org/rbb
highermisclassificationoccurattheminoritysamples
andissoughtafterminimisationhighcosterrors.
FIG.1CLASSIFICATIONHIERARCHY
1)Preprocessing
www.seipub.org/rbb
3)UnderSampling
AtAlgoritmicLevel
www.seipub.org/rbb
Xiaoetal2012),combinationofensemblelearning
with cost sensitive learning yields a new version
known as dynamic classifier ensemble method for
imbalanced data (DCEID). Eventually new cost
sensitive selection criteria for Dynamic Classifier
Selection (DCS) and Dynamic Ensemble Selection
(DES) are constructed respectively to enhance the
classification capability for imbalanced data. In
patternrecognitionrealm,featureextractionisseen
asimbalanceddataproblemforbothnegativeand
positivefeatures.Thismethod(JinghuaWangetal
2012) can be generalised to all domains. This
observation (Jinghua Wang et al 2012) covers two
models in which first model relates to candidate
extractors for minimising the other class and the
latteronedoesviceversa.Thiscombinationisless
likely to be affected by the imbalanced data
problem. Ensemble methods by binarization
technique focusing on onevsone and onevsall
decomposition strategies proved to be efficient in
(Mikel Galar et al 2011) for solving multiclass
problems. Here empirical analysis of different
aggregationsisusedtocombinetheoutputs.Inthe
neuro computing domain, model parameter
selectionviaalternatingSVMandgradientstepsto
minimise generalization error is employed (Todd
W. Schiller et al 2010) which can be extended to
protein interaction domain. Ensemble of SVM
provedtobeeffectiveinthiscase.Theproteinsub
cellular location is studied through CEPloc
learningmechanismwhichisaensembleapproach
combiningthepredictionsofthebaselearnerssuch
as SVM, nearest neighbour, probabilistic neural
network covariant discriminant produced
prediction accuracy of about 81.47% using jack
knifetest.Classifierensembleselectioncanbedone
using hybrid genetic algorithm (YoungWon Kim
et al 2008). Ensemble can be constructed carefully
emphasising the accuracies of the individual
classifiers based on the use of supervised
projections, both linear and nonlinear (Nicols
GarcaPedrajasetal2011).
2)EnsembleClassifiers
In recent years there has been development in the
field of ensemble classifiers in which the
advantages of all single classifiers are combined
together to yield a better prediction. Ensemble
methods are widely used in various disciplines
such as in (Larry Shoemaker et al 2008) where
classifier ensembles is used to label spatially
disjoint data. The combining method employed
here is the probabilistic majority voting.
Combination of ensemble learning with cost
sensitivelearningisproposedindifferentrealmin
(Jin Xiao et al 2012). These techniques can be
utilised in protein interaction domainas itis dealt
with imbalanced data problem. In this paper (Jin
MetaLearningRegime
Protein structure classification is calculated by meta
learners boosted and bagged meta learners but
random forest outperformed all the other meta
learners with the cross validated accuracy of 97.0%.
BaggingandAdaboostcangenerallybeadaptedtoits
usage in vector quantization (Noritaka Shigei et al
2009). Bagging can make weak learners to learn
parallel since random dataset is used for training
CombiningMethods
Combining methods are employed to evaluate and
specifyonefinalresultfortheensembleofpredictions.
Various combining methods of the literature are
evaluated in (Lior Rokach et al 2010) and are as
follows
Uniform Voting: In the uniform voting, each classifier
has the same weight. A classification of an unlabeled
instance is performed according to the class that
obtainsthehighestnumberofvotes.Mathematicallyit
canbewrittenas:
www.seipub.org/rbb
www.seipub.org/rbb
whereAisanormalizationfactordefinedas:
FIG.2MOVINGAVERAGESCOREOFEACHLEARNING
ALGORITHMAGAINSTDIMENSIONS
Conclusion
Numerous solutions to imbalanced data problem is
thoroughlystudiedinthispaper.Thesesolutionshave
been classified under various level such as data and
algorithmiclevel.Adetailedstudyofonepaperledto
the conclusion that there is a scope for modifying
BaggingandAdaboostwithRandomForestmethodas
itcandealwithhighdimensionaldataverywellbased
on the extensive study made on this domain. As a
future work comparative evaluation of ensemble of
ensembleclassifierswithhighdimensionaldatacanbe
studied.
where:
wherekdenotestheweightofthekthclassifier,such
that:
REFERENCES
ComparativeStudy
trainedwithdataresamplingstrategytoimprovecardiac
Medicine,Volume41,Issue5,Pages265271.
Akin Ozcift, Arif Gulten, December 2011, Classifier
ensemble construction with rotation forest to improve
medical diagnosis performance of machine learning
algorithms, Computer Methods and Programs in
Biomedicine,Volume104,Issue3,Pages443451.
Alberto Fernndez, Mara Jos del Jesus, Francisco Herrera,
August 2009, On the influence of an adaptive inference
system in fuzzy rule based classification systems for
www.seipub.org/rbb
imbalanceddatasets,ExpertSystemswithApplications,
Volume36,Issue6,Pages98059812.
Volume15,Issue8,Pages797801.
AsifullahKhan,AbdulMajid,MaqsoodHayat,August2011,
HemantIshwaran,UdayaB.Kogalur,July2010,Consistency
ofrandomsurvivalforests,Statistics&ProbabilityLetters,
subcellularlocationsbyfusingdifferentmodesofpseudo
Volume80,Issues1314,115,Pages10561064.
HungYi Lin, June 2012, Efficient classifiers for multiclass
Chemistry,Volume35,Issue4,Pages218229.
Volume53,Issue3,Pages473481.
learning,Neurocomputing,Volume73,Issues13,Pages
newweightedapproachtoimbalanceddataclassification
397411.
problemviasupportvectormachinewithquadraticcost
function, Expert Systems with Applications, Volume 38,
Issue7,Pages85808585.
JointConferenceonArtificialIntelligence(IJCAI01).
customer
SoftComputing,Volume12,Issue8,Pages24812485.
Recognition,Volume45,Issue3,Pages11361145.
MartnezMuoz,
bagging
ensembles,
class
Gonzalo
imbalanced
39,Issue3,Pages36683675.
Applications,Volume34,Issue4,Pages30213032.
HernndezLobato,
with
Daniel
classification
Usingclassifierensemblestolabelspatiallydisjointdata,
Neurocomputing,
InformationFusion,Volume9,Issue1,Pages120133.
Volume74,Issues1213,Pages22502264.
MiningandKnowledgeDiscoveryHandbook.
Volume40,Issue5,Pages509518.
Systems,Volume53,Issue1,April2012,Pages226233.
PlanningandInference,Volume141,Issue1,Pages597
601.
CO2RBFN,forimbalanceddatasets,PatternRecognition
Letters,Volume31,Issue15,Pages23752388.
PedroAntonioGutirrez,August2011,Adynamicover
HumbertoBustince,FranciscoHerrera,August2011,An
problems,PatternRecognition,Volume44,Issue8,Pages
18211833.
multiclassproblems:Experimentalstudyononevsone
andonevsallschemes,PatternRecognition,Volume44,
Issue8,Pages17611776.
www.seipub.org/rbb
2011,AcombinedSMOTEandPSObasedRBFclassifier
2008,AnEmpiricalEvaluationofSupervisedLearningin
Volume74,Issue17,Pages34563466.
ConferenceonMachineLearning,Helsinki,Finland,2008.
32143227.
SalvadorGarca,JoaqunDerrac,IsaacTriguero,CristbalJ.
Carmona,
Evolutionarybasedselectionofgeneralizedinstancesfor
3750.
Francisco
Herrera,
February
2012,
Volume25,Issue1,Pages312.
distributions,ExpertSystemswithApplications,Volume
Pages343359.
36,Issue3,Part1,Pages57185727.
distributions,ExpertSystemswithApplications,Volume
Issues13,Pages106114.
36,Issue3,Part1,Pages57185727.
ShuxueZou,YanxinHuang,YanWang,ChunguangZhou,
Recognition,Volume44,Issue8,Pages18011810.
ProteinDomainUsingDistanceBasedMaximalEntropy,
PilsungKang,SungzoonCho,DouglasL.MacLachlan,June
215223.
SimonBernard,SbastienAdam,LaurentHeutte,September
2012, Dynamic Random Forests, Pattern Recognition
Applications,Volume39,Issue8,Pages67386753.
Letters,Volume33,Issue12,Pages15801586.
andChemistry,Volume33,Issue3,Pages216223.
injuryriskwithanensembleofsupportvectormachines,
PremMelville,NishitShah,LilyanaMihalkova,RaymondJ.
1867.
MissingandNoisyData,Proceedingsof5thInternational
SpringerVerlag,.
65856608.
Newsletter6,6069.
www.seipub.org/rbb
imbalanceddatasets,ExpertSystemswithApplications,
Volume37,Issue12,Pages83038312,December2010.
ApplicationsofArtificialIntelligence,Volume21,Issue5,
Pages785795.
YangLiu,XiaohuiYu,JimmyXiangjiHuang,AijunAn,July
2011, Combining integrated sampling with SVM
imbalanced
Volume36,Pages3641.
Issue4,Pages617631.
data
in
predicting
proteinprotein
YoungWonKim,IlSeokOh,April2008,Classifierensemble
RecognitionLetters,Volume29,Issue6,Pages796802.
PartA,Pages164170.
YanminSun,MohamedS.Kamel,AndrewK.C.Wong,Yang
criterion
for
classification
of
imbalanced
data,