A Survey: Evaluation of Ensemble Classifiers and Data Level Methods To Deal With Imbalanced Data Problem in Protein-Protein Interactions

Review of Bioinformatics and Biometrics (RBB) Volume 2 Issue 1, March 2013
www.seipub.org/rbb
ASurvey:EvaluationofEnsembleClassifiers
andDataLevelMethodstoDealwith
ImbalancedDataProbleminProteinProtein
Interactions
SeenaMaryAugusty*1,SminuIzudheen2
DeparmentofComputerScience,RajagiriSchoolofEngineeringandTechnology,India
*1
seenamaryaugusty@gmail.com;2sminu_i@rajagiritech.ac.in
each has its own advantage over the other. The
significant difficulty and frequent occurrence of the
class imbalance problem indicate the need for extra
research efforts. This paper extensively evaluates
recentdevelopmentsinthefieldofsolvingimbalanced
data problem and subsequently classifying the new
solutions under each category. Finally proposing a
slight enhancement for the solution of integrated
cluster under sampling with ensemble classifiers,
replacing bagging and Adaboost with Random forest
for the paper (yongqing et al 2012).The combining
methodemployedisMajorityvotingofallthedecision
trees.
Abstract
Overthepastfewdecades,proteininteractionshavegained
importance in many applications of prediction and data
mining. They aid in cancer prediction and various other
disease diagnosis. Imbalanced data problem in protein
interactions can be resolved both at data as well as
algorithmiclevels.Thispaperevaluatesandsurveysvarious
methods applicable at data level as well as ensemble
methodsatalgorithmiclevel.Clusterbasedundersampling,
over sampling along with data based methods were
evaluated under Data level. Ensemble classifiers were
evaluated at the algorithmic level. Unstable base classifiers
such as SVM and ANN can be employed for ensemble
classifiers such as Bagging, Adaboost, Decorate, Ensemble
nonnegativematrixfactorizationandsoon.Randomforest
can improve the ensemble classification in dealing with
imbalanceddataproblemoverBaggingaswellasAdaboost
methodforhighdimensionaldata.
Reviews under each Category

Theinsightgainedfromthecomprehensiveanalysisof
various solutions for handling imbalanced data
problem,arereviewedinthispaper.
Keywords
ImbalancedDataProbleminVariousDisciplines
Bagging;Adaboost;Decorate;Oversampling;Undersampling
Imbalanced data problem arises when the number of

interactingpairsisverymuchlessthanthenumberof
non interacting pairs. Former is known as positive
dataset and the latter is known as negative samples.
Protein in the same sub cellular location is seen as
positivesampleandinnonsubcellularlocationisseen
as negative sample. Various methods for wide range
of applications to solve imbalanced data problem are
presentwhichcanbeusedtochecktheircompatibility
with the protein interaction domain. One such
generalisation of binary cases is described in paper
(Victoria Lpez et al 2012). This focuses the intrinsic
behaviours of the imbalanced data problem such as
class overlap and dataset shift. It is a cost sensitive
learningsolutionthatintegratesmodelatbothdataas
well as algorithmic level under the assumption that
Introduction
Proteinprotein interactions play numerous important
roles in many cellular functions, including DNA
replication, transcription and translation, signal
transduction, and metabolic pathways. Thereby,
aidingindiagnosisandprognosisofvariousdiseases.
Recently, a huge increase in the number of protein
protein interactions has made the prediction of
unknown proteinprotein interactions important for
the understanding of living cells. However, the
proteinprotein interactions experimentally obtained
so far are often incomplete and contradictory and,
consequently, computational methods now have an
upper hand in predictions. These prediction methods
have been reviewed for classification under which
www.seipub.org/rbb
highermisclassificationoccurattheminoritysamples
andissoughtafterminimisationhighcosterrors.
classification under which each solution can be

categorised.
AtDataLevel
Sampling is done at the data level in which either
minoritysamplesizeisincreasedasinoversampling
or majority sample size is reduced as in under
sampling. Methods utilising these two techniques are
reviewed under each category. Both under sampling
and oversampling can be incorporated with an
ensemble of SVM which can improve prediction as
mentioned in the paper (Yang Liu et al 2011). Pre
processing is an important tool for dealing with
unevendistributionofthedataset.Thepaper(Alberto
Fernndez et al 2009) revisits a new concept of
adaptiveinferencesystemwithparametricconjunction
operatorsonthefuzzyrulebasedclassificationsystem.
FIG.1CLASSIFICATIONHIERARCHY
1)Preprocessing
The imbalanced data problem is relaxed in

unsupervised self organising learning with support
vectorrankingasmentionedin(YokYenNguwietal
2010). In this method variables are selected by the
model adopted by support vector machines to deal
with this problem. ESOM also known as Emergent
SelfOrganising Map is used to cluster the ranker
features so as to provide for unsupervised cluster
classification. A KolomogrovSmirnov statistic based
ondecisiontreemethod(KStree)(RongshengGonget
al2012)isalatestmethodinwhichcomplexproblem
is divided into several easier sub problems, in that
case imbalanced distribution becomes less daunting.
This method is also used for feature selection
removing the redundant ones. After division, a two
way resampling is employed to determine optimal
sampling criteria and rebalanced data is used to
incorporate into logistic regression models. Thus
distributionofthedatasetisusedasanadvantagefor
this method. Recently, information granulation based
data mining (MuChen Chen et al 2008) has gained a
wide acceptance which uses the concept of human
ability to process the information tackles the
imbalanced data problem. While balancing the
accuraciesovertheclassesitmayresultinincreaseof
accuracy over minority class whereas the other
decreases. So in the multiobjective optimisation
approach for class imbalance learning (Paolo Soda et
al2011)achievesglobalaccuracybythechoicedriven
by the parameter on the validation set and, between
the output of a classifier trained on the original
skewed distribution and the output of a classifier
trainedaccordingtoalearningmethodaddressingthe
course of imbalanced data. Figure 1 shows the
Other way of tackling the inequitable distribution

ofdatasetisbypreprocessingthedatabeforehand
to the learning process. In this paper (Salvador
Garca et al 2012), an exemplar that accomplishes
learningprocessbystoringentitiesintheEuclidean
nspace. Prediction of the incoming dataset is
performed by computing their distance to the
nearestexemplar.Thisexemplarischosenbasedon
the evolutionary algorithms. Analysis of an
evolutionaryRBFNdesignalgorithm,CO2RBFN,a
evolutionary cooperativecompetitive model for
imbalanced data sets (Mara Dolores PrezGodoy
et al 2010) is made. It can work well with pre
processing method such as SMOTE. As in
(Francisco FernndezNavarro et al 2011), in the
first stage, the minority class is applied with over
sampling procedure to balance in part the size of
the classes. Then, the MA(memetic algorithm) is
run and the data are again oversampled in
different generations of the evolution, generating
newpatternsoftheminimumsensitivityclass.MA
optimises radial basis functions neural network
(RBFNNs). These methods include different over
samplingproceduresinthepreprocessingstage,a
thresholdmoving method where the output
threshold is transversed toward inexpensive
classes and ensembles approaches combining the
models obtained with these techniques overcomes
to a great extent of the imbalanced data problem.
Preprocessing unbalanced data using SVM
(M.A.H.Farquadetal2012)firstemploysSVMasa
preprocessor and then the target values of SVM
predicting 3D structure of a protein as well as in

the machine learning system on imbalanced
datasets.Imbalanceddataproblemisdealtin(Der
ChiangLietal2010)byexploitingundersampling
of dataset by megatrend diffusion membership
function for the minority class, and over sampling
by building up the Gaussian type fuzzy
membershipfunctionandcuttoreducethedata
size. It is found to be effective in solving
unbalanced data by the usage of clustering based
undersamplingandthenensembleasdiscussedin
(Pilsung Kang et al 2012). A novel approach of
inverse random under sampling (IRUS) is
proposed in (Muhammad Atif Tahir et al 2012). A
composite decision boundary is constructed
betweenmajorityclassandminorityclassbasedon
the training set produced by extensively under
samplingthemajorityclass.Promisingresultshave
beenfoundoutforthisundersamplingtechniques
outperforming all other classical under sampling
techniques. Condensed nearest neighbour rule
stores subset of the dataset which has efficient
implementation of the nearest neighbour decision
rule. Tomek has found yet another subset which
makesthetrainingsetconsistentknownasTomek
links(inGabrielgraph).Anewcounterexampleto
Tomeks consistency theorem has been proved in
(GodfriedTToussaintetal1994).Sothispavesyet
another path to solving data imbalanced problem
atdatalevel.Costsensitivelearning(CharlesElkan
etal2001)canbeappliedforoptimalcostsensitive
classification which makes changes in the
proportion of the negative sample generalised
beneathundersamplingtechnique.
are replaced by the predictions of trained SVM in

turnareusedtotrainMultilayerPerceptron(MLP),
LogisticRegression(LR),andRandomForest(RF).
This method efficiently tackles the uneven
distributionofdataset.
2)OverSampling
Minority kind of sample is clustered using K
meansandsubsequentlyusinggeneticalgorithmto
gainanewsamplewhichhasvalidinformationas
proposed in (Yang Yong et al 2012) could be
employed to enhance the performance of the
minority kind in the imbalanced data set. A
combinedSMOTEandPSObasedRBFclassifierfor
twoclass imbalanced problems (Ming Gao et al
2011), is a powerful technique for integrating the
synthetic minority oversampling technique
(SMOTE) and the particle swarm optimisation
(PSO) and radial basis function (RBF) classifier.
Synthetic instances for the positive class is
generated by SMOTE in order to balance the
trainingdataset.ThenRBFclassifierisconstructed
based on the over sampled training data. Cluster
basedundersamplingisdemonstratedeffectivein
(ShowJane Yen et al 2009) solving imbalanced
distribution by removing the clusters of the
majority classes which are nearer to the minority
class. Over sampling can be done by simple
randomsamplinginwhichhighvarianceproduced
by the HorvitzThompson estimator is used as the
paramountcharacteristicsforresampling.Inpaper
(Nicols GarcaPedrajas et al 2011), misclassified
instances are used to find supervised projections
andoversamplingconceptsarealsodefined.
www.seipub.org/rbb
3)UnderSampling
AtAlgoritmicLevel
Cluster based under sampling is prominent in the

paper (ShowJane Yen et al 2009) which aims at
resolving imbalanced data distribution. Training
dataselectionneedstobetakencareofwellasthe
classifier can predict the incoming data belongs to
majorityclassifmostoftherepresentativedataare
taken from the majority class. Here comes the
relevance of under sampling in the imbalanced
data distribution. The protein domain detection
(Shuxue Zou et al 2008) is first taken as an
imbalanceddatalearningproblemandthismethod
is based on analyzing multiple sequence
alignments. An under sampling method is put
forwardondistancebasedmaximalentropyinthe
feature space of SVM. Consequently, it helps in
Learning and building of models is accomplished in

the algorithmic level. Either a single classifier or
ensemble of classifiers can be employed. Algorithms
are classified accordingly to the above mentioned
criteria.
1) SingleClassifiersandComputationalMethods
Margin calibration in SVM classimbalanced
learning(ChanYun Yang et al 2009) utilises the
identificationofreferencecostsensitiveprototype
as a penaltyregularized model. This method
adopts an inversed proportional regularised
penalty to reweight the imbalanced classes. Then
two regularisation factors such as penalty and
marginisyieldedtooutputunbiasedclassification.
www.seipub.org/rbb
Imbalanced learning tasks cannot be handled by

conventional SVM as they tend to classify the
entities of majority class which is a less important
class. In order to solve this problem, a method
known as Learning SVM with weighted margin
criterion for classification of imbalanced data
(ZhuangyuanZhaoetal2011)isexploited.Herea
weighted maximum margin criterion to optimize
the datadependent kernel is observed. Hence,
giving chance to the minority class of being more
clustered.Theweightparametersareembeddedin
the Lagrangian SVM formulation is employed for
imbalanced data classification problem via SVM
with quadratic cost function (Jae Pil Hwang et al
2011). When protein dataset are stored in multi
relational database (ChienI Lee et al 2008), multi
relationalgmeandecisiontreealgorithmisusedto
solve imbalanced data problem. Multivariate
statisticalanalysesisdepictedtoimproveefficiency
in classifiers (HungYi Lin et al 2012).This
multivariate statistical analyses solve problems
which are stalled by high dimensionality hence
improves classification training time. A novel
approach of combining ANOVA(analysis of
variance), FCM (Fuzzy clustering algorithm), and
BFO (bacterial foraging optimisation) is put
forward as new computational method for
unbalanced data (ChouYuan Lee et al 2012), by
first selection of beneficial feature subsets(by
ANOVA), then clustering data into membership
degrees (by FCM) and finally convergence is
provided by yielding of global optima (by BFO).
TwoclasslearningforSVM(RaskuttiB.Etal2004)
is investigated in which aggressive dimensionality
reductionisdonetoimprovetheclassification.
Xiaoetal2012),combinationofensemblelearning
with cost sensitive learning yields a new version
known as dynamic classifier ensemble method for
imbalanced data (DCEID). Eventually new cost
sensitive selection criteria for Dynamic Classifier
Selection (DCS) and Dynamic Ensemble Selection
(DES) are constructed respectively to enhance the
classification capability for imbalanced data. In
patternrecognitionrealm,featureextractionisseen
asimbalanceddataproblemforbothnegativeand
positivefeatures.Thismethod(JinghuaWangetal
2012) can be generalised to all domains. This
observation (Jinghua Wang et al 2012) covers two
models in which first model relates to candidate
extractors for minimising the other class and the
latteronedoesviceversa.Thiscombinationisless
likely to be affected by the imbalanced data
problem. Ensemble methods by binarization
technique focusing on onevsone and onevsall
decomposition strategies proved to be efficient in
(Mikel Galar et al 2011) for solving multiclass
problems. Here empirical analysis of different
aggregationsisusedtocombinetheoutputs.Inthe
neuro computing domain, model parameter
selectionviaalternatingSVMandgradientstepsto
minimise generalization error is employed (Todd
W. Schiller et al 2010) which can be extended to
protein interaction domain. Ensemble of SVM
provedtobeeffectiveinthiscase.Theproteinsub
cellular location is studied through CEPloc
learningmechanismwhichisaensembleapproach
combiningthepredictionsofthebaselearnerssuch
as SVM, nearest neighbour, probabilistic neural
network covariant discriminant produced
prediction accuracy of about 81.47% using jack
knifetest.Classifierensembleselectioncanbedone
using hybrid genetic algorithm (YoungWon Kim
et al 2008). Ensemble can be constructed carefully
emphasising the accuracies of the individual
classifiers based on the use of supervised
projections, both linear and nonlinear (Nicols
GarcaPedrajasetal2011).
2)EnsembleClassifiers
In recent years there has been development in the
field of ensemble classifiers in which the
advantages of all single classifiers are combined
together to yield a better prediction. Ensemble
methods are widely used in various disciplines
such as in (Larry Shoemaker et al 2008) where
classifier ensembles is used to label spatially
disjoint data. The combining method employed
here is the probabilistic majority voting.
Combination of ensemble learning with cost
sensitivelearningisproposedindifferentrealmin
(Jin Xiao et al 2012). These techniques can be
utilised in protein interaction domainas itis dealt
with imbalanced data problem. In this paper (Jin
MetaLearningRegime
Protein structure classification is calculated by meta
learners boosted and bagged meta learners but
random forest outperformed all the other meta
learners with the cross validated accuracy of 97.0%.
BaggingandAdaboostcangenerallybeadaptedtoits
usage in vector quantization (Noritaka Shigei et al
2009). Bagging can make weak learners to learn
parallel since random dataset is used for training
whereas Adaboost can make weak learners to learn

sequentiallysincepreviousmisclassifieddataisgiven
more probability of choosing in the next learning
section.
and random selection of variables. It is proved that

forestensemblesurvivalfunctionconvergesuniformly.
Decorate:Decorate method constructs diverse learners
by using artificial data. It works well in cases of
missingfeatures,classificationnoiseandfeaturenoise
as observed in (Prem Melville et al 2004). Decorate
outsmarts Bagging and Adaboost in cases mentioned
above. Decorate effectively decreases the error of the
baselearner.
Bagging:A new emerging concept of Ensemble based

regression analysis founded on the filtering based
ensembleisseensuperiortothebootstrapaggregating
as studied in (WeiLiang Tay et al 2012). Bagging
method has its own advantage over pruning
regression ensembles in which exponential cost is in
the size of the ensemble. It is solved using semi
definiteprogramming(SDP)ormodifyingtheorderof
aggregation(DanielHernndezLobatoetal2011).Sub
ensembles obtained using either SDP or ordered
aggregation usually outperform sub ensembles
obtained by other ensemble pruning methods and
ensemblesgeneratedbytheAdaboost.
CombiningMethods
Combining methods are employed to evaluate and
specifyonefinalresultfortheensembleofpredictions.
Various combining methods of the literature are
evaluated in (Lior Rokach et al 2010) and are as
follows
Uniform Voting: In the uniform voting, each classifier
has the same weight. A classification of an unlabeled
instance is performed according to the class that
obtainsthehighestnumberofvotes.Mathematicallyit
canbewrittenas:
Adaboost: One of the meta technique, Adaboost

Algorithm, is introduced with cost terms into this
learningframework(YanminSunetal2011)leadingto
the exploration of three models, and one of them
tallies with stagewise additive modelling statistics to
minimisethecostexponentialloss.Thusitaddstoan
efficient algorithm for resolving imbalanced data
problem. Adaboost can incorporate SVM as its
componentclassifierasseenin(XuchunLietal2008),
also known as AdaboostSVM outperforms all its
counterparts component classifiers such as Decision
TreesandNeuralNetworks.Itisunderthenotionthat
sequenceoftrainedRBFSVMreducesprogressivelyas
theboostingiterationproceeds.
Where Mk denotes classifier k and ^ PMk (y = c | x)

denotestheprobabilityofyobtainingthevaluecgiven
aninstancex.
Distribution Summation: The idea behind distribution
summation is to sum up the conditional probability
vectorobtainedfromeachclassifier.Theselectedclass
is chosen according to the highest value in the total
vector.Mathematically,itcanbewrittenas:
RandomForest:Random Forest has a wide application

in which the ensemble classifier can be learned with
resampleddata(Akinziftetal2011).Sincerandom
forest is forest of decision trees, the prediction is
enhanced better than a single decision tree. 30
classifier ensembles are constructed based on RF
algorithm proved to have accuracy of 87.13% as
illustratedin(AkinOzciftetal2011).Anewextension
ofrandomforestknownasDynamicRandomForests
(DRF) is studied in (Simon Bernard et al 2012). It is
based on a adaptive tree induction procedure such
thateachtreecomplementasmuchastreespossiblein
RF.Itisdonethroughresamplingoftrainingdataand
boosting algorithm and found to produce promising
resultsthantheconventionalRF.Anothernewversion
of RF is the random survival forests (Hemant
Ishwaranetal2010).Consistencyofthenewmethodis
proved under general splitting rules, bootstrapping
www.seipub.org/rbb
Bayesian Combination: This combining method was

investigatedbyBuntine(1990).Theweightassociated
with each classifier is the posterior probability of the
classifiergiventhetrainingset.
where P(Mk | S ) denotes the probability that the

classifier Mk is correct given the training set S. The
estimation of P(Mk | S ) depends on the classifiers
representation.
DempsterShafer: The idea of using the Dempster
Shafer theory of evidence (Buchanan and Shortliffe,
1984) for combining models has been suggested by
Shilen (1992). This method uses the notion of basic
www.seipub.org/rbb
probability assignment defined for a certain class ci

giventheinstancex:
following figure 2 suggests the performance of

Randomforestinhighdimensions.Basedonthestudy
of (Rich Caruana et al 2008) paves the capability and
compatibility of choosing Random forest for
(Yongqing Zhang et al 2012) seems an efficient
solutionoverBaggingandAdaboostmethod
Subsequently, the selected class is the one that

maximizesthevalueofthebelieffunction:
whereAisanormalizationfactordefinedas:
Nave Bayes: Using Bayes rule, one can extend the

NaveBayesideaforcombiningvariousclassifiers:
Entropy Weighting: Entropy weighting gives each

classifieraweightthatisinverselyproportionaltothe
entropyofitsclassificationvector.
FIG.2MOVINGAVERAGESCOREOFEACHLEARNING
ALGORITHMAGAINSTDIMENSIONS
Conclusion
Numerous solutions to imbalanced data problem is
thoroughlystudiedinthispaper.Thesesolutionshave
been classified under various level such as data and
algorithmiclevel.Adetailedstudyofonepaperledto
the conclusion that there is a scope for modifying
BaggingandAdaboostwithRandomForestmethodas
itcandealwithhighdimensionaldataverywellbased
on the extensive study made on this domain. As a
future work comparative evaluation of ensemble of
ensembleclassifierswithhighdimensionaldatacanbe
studied.
where:
LogarithmicOpinionPool:According to the logarithmic

opinion pool (Hansen, 2000) the selection of the
preferredclassisperformedaccordingto:
wherekdenotestheweightofthekthclassifier,such
that:
REFERENCES
Akin zift, May 2011, Random forests ensemble classifier
ComparativeStudy
trainedwithdataresamplingstrategytoimprovecardiac
Random forest performs well in the case of high

dimensional data. So enhancement of (Yongqing
Zhang et al 2012) can be proposed in which under
samplingtechniqueatthedatalevelaswellasrandom
forestatalgorithmiclevelcanbeintegratedtobenefita
better prediction. Feature selection can be done
through auto covariance method, and the base
learnerscanbeSVMandANNasin(YongqingZhang
etal2012).Therandomforestwhichisacombination
of all the decision trees posterior to randomising of
datasets. As stated in (Rich Caruana et al 2008), the
arrhythmia diagnosis, Computers in Biology and
Medicine,Volume41,Issue5,Pages265271.
Akin Ozcift, Arif Gulten, December 2011, Classifier
ensemble construction with rotation forest to improve
medical diagnosis performance of machine learning
algorithms, Computer Methods and Programs in
Biomedicine,Volume104,Issue3,Pages443451.
Alberto Fernndez, Mara Jos del Jesus, Francisco Herrera,
August 2009, On the influence of an adaptive inference
system in fuzzy rule based classification systems for
www.seipub.org/rbb
imbalanceddatasets,ExpertSystemswithApplications,
neighbor decision rule, Pattern Recognition Letters,
Volume36,Issue6,Pages98059812.
AsifullahKhan,AbdulMajid,MaqsoodHayat,August2011,
HemantIshwaran,UdayaB.Kogalur,July2010,Consistency
CEPLoc: An ensemble classifier for predicting protein
ofrandomsurvivalforests,Statistics&ProbabilityLetters,
subcellularlocationsbyfusingdifferentmodesofpseudo
Volume80,Issues1314,115,Pages10561064.
HungYi Lin, June 2012, Efficient classifiers for multiclass
amino acid composition, Computational Biology and
classification problems, Decision Support Systems,
Chemistry,Volume35,Issue4,Pages218229.
ChanYun Yang, JrSyu Yang, JianJun Wang , December

2009, Margin calibration in SVM classimbalanced
Jae Pil Hwang, Seongkeun Park, Euntai Kim, July 2011, A
learning,Neurocomputing,Volume73,Issues13,Pages
newweightedapproachtoimbalanceddataclassification
397411.
problemviasupportvectormachinewithquadraticcost
function, Expert Systems with Applications, Volume 38,
Charles Elkan, 2001, The Foundations of CostSensitive
Issue7,Pages85808585.
Learning, Proceedings of the Seventeenth International
Jin Xiao, Ling Xie, Changzheng He, Xiaoyi Jiang, 15
JointConferenceonArtificialIntelligence(IJCAI01).
February 2012, Dynamic classifier ensemble model for
ChienI Lee, ChengJung Tsai, TongQin Wu, WeiPang
customer
Yang, May 2008, An approach to mining the multi
Extract minimum positive and maximum negative
algorithm applied to classify unbalanced data, Applied
features for imbalanced binary classification, Pattern
SoftComputing,Volume12,Issue8,Pages24812485.
Recognition,Volume45,Issue3,Pages11361145.
MartnezMuoz,
Larry Shoemaker, Robert E. Banfield, Lawrence O. Hall,
Alberto Surez, June 2011, Empirical analysis and
Kevin W. Bowyer, W. Philip Kegelmeyer, January 2008,
evaluation of approximate techniques for pruning

regression
bagging
ensembles,
class
Jinghua Wang, Jane You, Qin Li, Yong Xu , March 2012,
ChouYuan Lee, ZneJung Lee, August 2012, A novel
Gonzalo
imbalanced
39,Issue3,Pages36683675.
Applications,Volume34,Issue4,Pages30213032.
HernndezLobato,
with
distribution, Expert Systems with Applications, Volume
relational imbalanced database, Expert Systems with
Daniel
classification
Usingclassifierensemblestolabelspatiallydisjointdata,
Neurocomputing,
InformationFusion,Volume9,Issue1,Pages120133.
Volume74,Issues1213,Pages22502264.
Lior Rokach , 2010, Ensemble methods for classifiers, Data
DerChiang Li, ChiaoWen Liu, Susan C. Hu, May 2010, A
MiningandKnowledgeDiscoveryHandbook.
learning method for the class imbalance problem with
M.A.H. Farquad, Indranil Bose Preprocessing unbalanced
medical data sets, Computers in Biology and Medicine,
data using support vector machine, Decision Support
Systems,Volume53,Issue1,April2012,Pages226233.
Erika Antal, Yves Till, January 2011,Simple random
Mara Dolores PrezGodoy, Alberto Fernndez, Antonio
sampling with overreplacement, Journal of Statistical
Jess Rivera, Mara Jos del Jesus, November 2010,
PlanningandInference,Volume141,Issue1,Pages597
Analysis of an evolutionary RBFN design algorithm,
601.
CO2RBFN,forimbalanceddatasets,PatternRecognition
Francisco FernndezNavarro, Csar HervsMartnez,
Letters,Volume31,Issue15,Pages23752388.
PedroAntonioGutirrez,August2011,Adynamicover
Mikel Galar, Alberto Fernndez, Edurne Barrenechea,
sampling procedure based on sensitivity for multiclass
HumbertoBustince,FranciscoHerrera,August2011,An
problems,PatternRecognition,Volume44,Issue8,Pages
overview of ensemble methods for binary classifiers in
18211833.
multiclassproblems:Experimentalstudyononevsone
andonevsallschemes,PatternRecognition,Volume44,
Godfried T Toussaint, August 1994, A counterexample to
Issue8,Pages17611776.
Tomeks consistency theorem for a condensed nearest
www.seipub.org/rbb
Ming Gao, Xia Hong, Sheng Chen, Chris J. Harris, October
Rich Caruana ,Nikos Karampatziakis, Ainur Yessenalina,
2011,AcombinedSMOTEandPSObasedRBFclassifier
2008,AnEmpiricalEvaluationofSupervisedLearningin
for twoclass imbalanced problems, Neurocomputing,
High Dimensions,Proceedings of the 25th International
ConferenceonMachineLearning,Helsinki,Finland,2008.
MuChen Chen, LongSheng Chen, ChunChin Hsu, Wei
Rongsheng Gong, Samuel H. Huang, May 2012, A
Rong Zeng, August 2008, An information granulation
KolmogorovSmirnov statistic based segmentation
based data mining approach for classifying imbalanced
approach to learning from imbalanced datasets: With
data, Information Sciences, Volume 178, Issue 16, Pages
application in property refinance prediction, Expert
32143227.
Systems with Applications, Volume 39, Issue 6, Pages

61926200.
Muhammad Atif Tahir, Josef Kittler, Fei Yan, October 2012,
SalvadorGarca,JoaqunDerrac,IsaacTriguero,CristbalJ.
Inverse random under sampling for class imbalance

problem and its application to multilabel classification,
Carmona,
Pattern Recognition, Volume 45, Issue 10, Pages 3738
Evolutionarybasedselectionofgeneralizedinstancesfor
3750.
imbalanced classification, KnowledgeBased Systems,
Nicols GarcaPedrajas, Csar GarcaOsorio, January 2011,
Francisco
Herrera,
February
2012,
Constructing ensembles of classifiers using supervised
ShowJane Yen, YueShi Lee , April 2009, Clusterbased
projection methods based on misclassified instances,
undersampling approaches for imbalanced data
Expert Systems with Applications, Volume 38, Issue 1,
distributions,ExpertSystemswithApplications,Volume
Pages343359.
36,Issue3,Part1,Pages57185727.
Noritaka Shigei, Hiromi Miyajima, Michiharu Maeda, Lixin
ShowJane Yen, YueShi Lee, April 2009, Clusterbased
Ma, December 2009, Bagging and AdaBoost algorithms
undersampling approaches for imbalanced data
for vector quantization, Neurocomputing, Volume 73,
distributions,ExpertSystemswithApplications,Volume
Issues13,Pages106114.
36,Issue3,Part1,Pages57185727.
Paolo Soda, August 2011, A multiobjective optimisation
ShuxueZou,YanxinHuang,YanWang,ChunguangZhou,
approach for class imbalance learning, Pattern
September 2008, A Novel Method for Prediction of
Recognition,Volume44,Issue8,Pages18011810.
ProteinDomainUsingDistanceBasedMaximalEntropy,
PilsungKang,SungzoonCho,DouglasL.MacLachlan,June
Journal of Bionic Engineering, Volume 5, Issue 3, Pages
2012, Improved response modeling based on clustering,
215223.
undersampling, and ensemble, Expert Systems with
SimonBernard,SbastienAdam,LaurentHeutte,September
2012, Dynamic Random Forests, Pattern Recognition
Applications,Volume39,Issue8,Pages67386753.
Letters,Volume33,Issue12,Pages15801586.
Pooja Jain, Jonathan M. Garibaldi, Jonathan D. Hirst, June

2009, Supervised machine learning algorithms for
Todd W. Schiller, Yixin Chen, Issam El Naqa, Joseph O.
protein structure classification, Computational Biology
Deasy, June 2010, Modeling radiationinduced lung
andChemistry,Volume33,Issue3,Pages216223.
injuryriskwithanensembleofsupportvectormachines,
PremMelville,NishitShah,LilyanaMihalkova,RaymondJ.
Neurocomputing, Volume 73, Issues 1012, Pages 1861
Mooney, June 2004, Experiments on Ensembles with
1867.
MissingandNoisyData,Proceedingsof5thInternational
Victoria Lpez, Alberto Fernndez, Jose G. MorenoTorres,
Workshop on Multiple Classifier Systems (MCS
Francisco Herrera, June 2012, Analysis of preprocessing
2004),LNCS Vol. 3077, pp. 293302, Cagliari, Italy,
vs. costsensitive learning for imbalanced classification.
SpringerVerlag,.
Open problems on intrinsic data characteristics, Expert
Raskutti, B., Kowalczyk, A., 2004. Extreme rebalancing for
Systems with Applications, Volume 39, Issue 7, Pages
SVMs: a case study. ACM SIGKDD Explorations
65856608.
Newsletter6,6069.
WeiLiang Tay, CheeKong Chui, SimHeng Ong, Alvin
www.seipub.org/rbb
classification of imbalanced data ,Pattern Recognition,

ChoongMeng Ng, August 2012, Ensemblebased
YokYen Nguwi, SiuYeung Cho, An unsupervised self
regression analysis of multimodal medical data for

osteopeniadiagnosis,ExpertSystemswithApplications.
organizing learning with support vector ranking for
Xuchun Li, Lei Wang, Eric Sung, August 2008, AdaBoost
imbalanceddatasets,ExpertSystemswithApplications,
with SVMbased component classifiers, Engineering
Volume37,Issue12,Pages83038312,December2010.
ApplicationsofArtificialIntelligence,Volume21,Issue5,
Yongqing Zhang, Danling Zhang, Gang Mi, Daichuan Ma,

Gongbing Li, Yanzhi Guo, Menglong Li, Min Zhu ,
Pages785795.
February 2012, Using ensemble methods to deal with
YangLiu,XiaohuiYu,JimmyXiangjiHuang,AijunAn,July
2011, Combining integrated sampling with SVM
imbalanced
ensembles for learning from imbalanced datasets,
interactions , Computational Biology and Chemistry,
Information Processing & Management, Volume 47,
Volume36,Pages3641.
Issue4,Pages617631.
data
in
predicting
proteinprotein
YoungWonKim,IlSeokOh,April2008,Classifierensemble
Yang Yong, 2012,The Research of Imbalanced Data Set of
selection using hybrid genetic algorithms, Pattern
Sample Sampling Method Based on KMeans Cluster
RecognitionLetters,Volume29,Issue6,Pages796802.
and Genetic Algorithm, Energy Procedia, Volume 17,
Zhuangyuan Zhao, Ping Zhong, Yaohong Zhao, August
PartA,Pages164170.
2011, Learning SVM with weighted maximum margin
YanminSun,MohamedS.Kamel,AndrewK.C.Wong,Yang
criterion
Wang, December 2011, Costsensitive boosting for
for
classification
of
imbalanced
data,
Mathematical and Computer Modelling, Volume 54,

Issues34,Pages10931099.

A Survey: Evaluation of Ensemble Classifiers and Data Level Methods To Deal With Imbalanced Data Problem in Protein-Protein Interactions

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Survey: Evaluation of Ensemble Classifiers and Data Level Methods To Deal With Imbalanced Data Problem in Protein-Protein Interactions

Încărcat de

Drepturi de autor:

Formate disponibile

Review of Bioinformatics and Biometrics (RBB) Volume 2 Issue 1, March 2013

Reviews under each Category

Imbalanced data problem arises when the number of

Review of Bioinformatics and Biometrics (RBB) Volume 2 Issue 1, March 2013

classification under which each solution can be

The imbalanced data problem is relaxed in

Other way of tackling the inequitable distribution

Review of Bioinformatics and Biometrics (RBB) Volume 2 Issue 1, March 2013

predicting 3D structure of a protein as well as in

are replaced by the predictions of trained SVM in

Cluster based under sampling is prominent in the

Learning and building of models is accomplished in

Review of Bioinformatics and Biometrics (RBB) Volume 2 Issue 1, March 2013

Imbalanced learning tasks cannot be handled by

Review of Bioinformatics and Biometrics (RBB) Volume 2 Issue 1, March 2013

whereas Adaboost can make weak learners to learn

and random selection of variables. It is proved that

Bagging:A new emerging concept of Ensemble based

Adaboost: One of the meta technique, Adaboost

Where Mk denotes classifier k and ^ PMk (y = c | x)

RandomForest:Random Forest has a wide application

Bayesian Combination: This combining method was

where P(Mk | S ) denotes the probability that the

Review of Bioinformatics and Biometrics (RBB) Volume 2 Issue 1, March 2013

probability assignment defined for a certain class ci

following figure 2 suggests the performance of

Subsequently, the selected class is the one that

Nave Bayes: Using Bayes rule, one can extend the

Entropy Weighting: Entropy weighting gives each

LogarithmicOpinionPool:According to the logarithmic

Akin zift, May 2011, Random forests ensemble classifier

Random forest performs well in the case of high

arrhythmia diagnosis, Computers in Biology and

Review of Bioinformatics and Biometrics (RBB) Volume 2 Issue 1, March 2013

neighbor decision rule, Pattern Recognition Letters,

CEPLoc: An ensemble classifier for predicting protein

amino acid composition, Computational Biology and

classification problems, Decision Support Systems,

ChanYun Yang, JrSyu Yang, JianJun Wang , December

Jae Pil Hwang, Seongkeun Park, Euntai Kim, July 2011, A

Charles Elkan, 2001, The Foundations of CostSensitive

Learning, Proceedings of the Seventeenth International

Jin Xiao, Ling Xie, Changzheng He, Xiaoyi Jiang, 15

February 2012, Dynamic classifier ensemble model for

ChienI Lee, ChengJung Tsai, TongQin Wu, WeiPang

Yang, May 2008, An approach to mining the multi

Extract minimum positive and maximum negative

algorithm applied to classify unbalanced data, Applied

features for imbalanced binary classification, Pattern

Larry Shoemaker, Robert E. Banfield, Lawrence O. Hall,

Alberto Surez, June 2011, Empirical analysis and

Kevin W. Bowyer, W. Philip Kegelmeyer, January 2008,

evaluation of approximate techniques for pruning

Jinghua Wang, Jane You, Qin Li, Yong Xu , March 2012,

ChouYuan Lee, ZneJung Lee, August 2012, A novel

distribution, Expert Systems with Applications, Volume

relational imbalanced database, Expert Systems with

Lior Rokach , 2010, Ensemble methods for classifiers, Data

DerChiang Li, ChiaoWen Liu, Susan C. Hu, May 2010, A

learning method for the class imbalance problem with

M.A.H. Farquad, Indranil Bose Preprocessing unbalanced

medical data sets, Computers in Biology and Medicine,

data using support vector machine, Decision Support