Sunteți pe pagina 1din 5

Calibrating SVMs with V -fold penalization

A. Saumard, S. Cl emencon
LTCI - Telecom ParisTech
46 rue Barrault
75013 Paris, France
first.last@telecom-paristech.fr
N. Baskiotis, N. Usunier
LIP6 - Universit e Pierre et Marie Curie
4, place Jussieu
75005 Paris, France
first.last@lip6.fr
Abstract
We tackle the problem of optimal tuning of the kernel Support Vector Machine
(SVM), concerning both the regularization and kernels parameters. In practice,
this task is traditionally solved via V -fold cross-validation (VFCV), which gives
rather efcient results for a reasonable computational cost. Recently, a new pe-
nalization procedure, called V -fold penalization, has been proposed in order to
improve on VFCV, both concerning the choice of V and the accuracy at a xed
V . We report on experiments where V -fold penalization actually shows compara-
ble results with VFCV for the SVM calibration, on both simulated and real-world
data. Moreover, we examine another selection rule, by designing a BIC-like pro-
cedure which uses V -fold penalization, a type of procedure generally used for
identication. This rule shows in some situations the best performances of the
three considered methods, typically when the approximation error of the kernel is
small for a reasonable value of the kernels parameter.
1 Introduction
The tuning of hyperparameters is a fundamental problem in machine learning. From a global per-
spective, there exist essentially two major approaches to model selection : methods related to data-
splitting, with cross-validation ([2]) and its variants, and methods related to penalization of the
empirical risk, with in particular the Structural Risk Minimization principle ([10]). V -fold cross-
validation (VFCV) is widely used in machine learning practice, due to its (relative) computational
tractability and to many empirical evidences of its good behavior. However, there exist only few the-
oretical justications concerning VFCV ([2]) and the question of an automatic choice of V remains
widely open ([5]).
Combining both robustness of cross-validation estimates and theoretical guarantees of penalization
procedures, a new type of general purpose penalization procedures, called V -fold penalization, has
been recently proposed ([1]). Both empirical and mathematical evidence of its efciency have been
shown in a heteroscedastic with random design regression framework, when considering the selec-
tion of nite-dimensional histogram models.
In this paper, our aim is to study experimentally V -fold penalization techniques in the context of
hyperparameter tuning for SVMs. We rst present the context and methods we studied and next
present some preliminary experimental results.
2 V -Fold Cross-Validation and V -Fold Penalization
Notations Let X Y be a measurable space endowed with an unknown probability measure P,
with Y ={1, 1} and X R
d
, d 1 We assume we observe n i.i.d. labelled observations
(X
1
, Y
1
) , ..., (X
n
, Y
n
) X Y and a generic random variable, independent of the data, is de-
1
noted by (X, Y ). The primary goal of the SVM is to nd a classier s, that is a function from X to
Y, with a classication error P(Y = s (X)) as small as possible.
We focus in this paper on the calibration of the soft-margin SVM algorithm, with Gaussian RBF
kernels k

(x, x

) = e
xx

2
. The Gaussian RBF kernel is certainly the most commonly used
kernel and Boser, Guyon and Vapnik suggested its widespread use [4]
Now, by setting

H
k
=
_
x X s (x) + b ; s H
k
, b R
_
, where real-valued biases are
added to the functions of the RKHS H
k
, the soft-margin SVM algorithm can be stated as
s
n
(, ) arg min
f=s+b

H
k
_
1
n
n

i=1
(1 Y
i
f (X
i
))
+
+ s
2
k
_
,
for some positive regularization parameter , where (t)
+
= max(t, 0). In order to predict labels in
Y, the output s
n
(, ) is then transformed into a classier s (, ) dened by
s (, ) = sign (s
n
(, )) .
As highlighted for instance in Hastie et al. [7], for each choice of a kernel k

, the optimal value of


the regularization parameter can signicantly change, and depends on the data at hand. To ensure the
performance in prediction of the SVM algorithm, the regularization parameter, as well as the kernel,
should thus be calibrated in each application. Finding the optimal (

) (wrt the classication


error) is a model selection task. The method whichis usually employed in practice for this kind
of goal is V -fold cross-validation (VFCV). We rst recall the criterion of VFCF before presenting
V -fold penalization procedures.
V -Fold Cross-Validation VFCV is a data-splitting method and certainly the most commonly used
cross-validation rule in practice.
In VFCV, the data is preliminary partitioned into V subsamples (B
1
, ..., B
V
) of approximately n/V
data each. For each data-split, the SVM estimator s
(j)
n
(, ) is learned on

k=j
B
k
and tested on
B
j
.More precisely, if we denote the V -fold cross-validation criterion by
crit
VFCV
(, ) =
1
V

V
j=1
err
(j)
_
s
(j)
(, )
_
where err
(j)
(f) =
1
2|B
(j)
|

kB
(j) |Y
k
f(X
k
)| for any f : X {1, 1}. The selected hyperpa-
rameters
_
,

_
are given by
_
,

_
= arg min
(,)
{crit
VFCV
(, )} . (1)
Moreover, after that the parameters
_
,

_
are selected, the nal output is trained on the whole
training sample, with these selected parameters.
Despite the generality of the heuristic underlying the procedure, there are two main drawbacks in
the V -fold cross-validation method for model selection. First, at a xed value of V , the procedure
is known to be asymptotically suboptimal in the sense that the risk of the selected estimator is not
equivalent to the risk of the oracle when the number of data tends to innity ([1]). In particular, the
expectation of VFCV criterion roughly overestimates the expectation of the true risk, and this bias
should be decreasing whenever V increases. On the other hand, the best CV estimator of the risk is
not necessarily the best model selection procedure, and Breiman and Spector ([5]) hightlight, in a
regression framework, that LOO is the best estimator of the risk, whereas 10-fold cross-validation
is more efcient for model selection purpose.
V -Fold Penalization Penalization is also a fully natural strategy for hyperparameter estimation.
Penalization approaches can be viewed as selecting the hyperparameters according to:
_
,

_
:= arg min
(,)
{err
n
( s(, )) + pen (, )} .
2
where err
n
(f) =
1
n

n
i=1
1
2
|Y
n
f(X
n
)| for f : X {1, 1} and pen (, ) is a penalty term.
A good penalty penalty in terms of prediction is a penalty which gives an accurate estimate of the
ideal penalty, dened by the difference between the empirical error and the true error.
The central idea of the recent V -fold penalties proposed by Arlot ([1]) is to directly estimate the
ideal penalty by a subsampled version of it. For some constant C
V
V 1, the V -fold penalty is:
pen
VF
(, ) :=
C
V
V
V

j=1
_
err
n
_
s
(j)
(, )
_
err
(j)
_
s
(j)
(, )
__
. (2)
The V -fold penalty (2) indeed mimicks the structure of the ideal penalty, in such a way that the quan-
tities related to the unknown law of data P (respectively to the empirical measure P
n
) are replaced
by quantities related to the empirical measure P
n
(respectively to the subsampling measures), in the
same analytic manner. The design of the V -fold penalties is thus an adaptation of Efrons resampling
heuristics ([6]) to the subsampling scheme of the V -fold procedure.
It should be put on emphasis that the constant C
V
in the denition of the V -fold penalty can be
viewed as a degree of freedom, which potentially allows to deal with the bias of the proposed risk
estimation, without varying the value of V , contrary to VFCV where only V xes simultaneously
and in a tricky manner, the bias and variance of the risk estimation.
Overpenalization for Identication We propose in this paragraph to take advantage of this degree
of freedom: the denition of V -fold penalty allows to choose C
V
, which however has to be larger
than V 1. There is thus a possibility for an overpenalization compared to the standard value of
C
V
= V 1. Such a model selection strategy is classically employed for the identication task
([11]), which roughly consists in estimating the true model amongst the given collection.
Moreover, when a true model does not exist, this can be extended by the notion of bias of a model,
which is dened to be the difference between the minimum of the risk associated to the functions of
the model and the risk of the target, which is minimal among all possible functions. More precisely,
in order to justify the identication task, one generally assumes that the bias has the following
behavior: it rst decreases with respect to model complexity until attaining the smallest model with
the smallest bias, and then more complex models have a constant bias, equal to the minimal one.
Consequently, overpenalization allows to identify the smallest model among those that are achieving
the minimum of the biases.
Concerning the SVM algorithm, it is well known that the output s
n
(, ) is also the minimizer of
the empirical hinge loss on a ball of the RKHS H
k
, which radius decreases with respect to the
regularization parameter ([3]). Therefore, the assumption of a model achieving the minimum of
the biases is reduced here to consider that, at a xed kernel k

, there exists a radius of ball in the


RKHS that achieves the minimum of the biases.
We argue that a procedure of identication applied to model selection for the SVM will thus estimate
the smallest radius achieving the minimumof the biases, whenvever it exists. Moreover, if this radius
is reasonably small, this should also give a good predictor in terms of classication error. Indeed,
when the bias assumption is satised, the identication procedures are known to be also efcient in
terms of prediction.
Let us now more precisely dene our procedure of identication. We used an analogy with the
classical density estimation framework, where AIC criterion is designed for prediction whereas BIC
is designed for identication ([8]). Hence, as the overpenalizing factor from AIC to BIC is ln (n) /2,
where n is the number of data, and as V -fold penalization with C
V
= V 1 is here designed for
prediction, we propose a BIC-like procedure using V -fold penalization, with the following penalty:
pen
BIC
(, ) =
ln (n)
2
pen
V F
(, ) (3)
where C
V
= V 1 in pen
V F
(, ).
As previously emphasized, this strategy should give efcient results in terms of prediction in certain
cases where the estimated model has a reasonable radius. Moreover, the principle underlying the
identication procedures is to overpenalize an unbiased estimator of the ideal penalty by a factor
3

n
which satises
n
+ and
n
/n 0 when n + ([9]). Here, we have taken
n
=
ln (n) /2, but the theory also predicts that other values of the overpenalization factor should work.
As all identication procedures aim at estimating the same model - the smallest ball that achieves
the minimum of the biases -, this suggests that when the assumption on the minimal bias is satised,
there should be a wide stability of the norm of the selected output of the SVM algorithm with respect
to the choice of the overpenalization factor
n
. Indeed the norm of the output roughly corresponds
to the radius of the theoretic ball to be identied and the latter does not depend on the specic value
of the overpenalization factor
n
.
3 Experiments
In order to study and compare the behavior of the three methods described above - VFCV (1), V -
fold penalization (2), BIC-like V -fold penalization (3) - we have conducted a set of experiments on
both synthetic and real world data: a synthetic dataset used in Hastie et al. ([7]), and four databases:
Banana, Splice and Waveform
1
.
In our experiments, we have simulated 10 realizations of either n = 50, 100 or 200 datawith V =
5, 10 or 50. Table 1 shows the (theoretical) classication error of the models selected by each method
on the simulated dataset. Table 2 shows the performances on the four other datasets.
It is seen on Table 1 that the procedure which performs the best is our BIC-like penalization, closely
followed by VFCV and then comes V -fold penalization (with a level of penalization C
V
= V 1).
The only exception is for V = 10 and n = 100 where VFCV has the smallest classication error,
followed by the two other methods, which obtain similar classication errors. We can conjecture
that this is due to the need of a slight overpenalization concerning these strategies of prediction in
the nonasymptotic framework - which would not be the same kind of overpenalization that the one
required in the BIC-like identication strategy. Indeed, while this slight overpenalization naturally
occurs in VFCV due its bias in risk estimation at xed V , V -fold penalization with C
V
= V 1
tends to be almost unbiased. In order to improve on the performance of V -fold penalization, a
solution would be to raise the actual choice of C
V
, but the value of overpenalization seems quite
hard to adjust, due to the lack of precise understanding on this issue.
We see in Table 2 the superiority of VFCV on V -fold penalization, as it has been noticed previ-
ously.We believe that VFCV actually benets from a nonasymptotic bias which tends to give ro-
bustness to this procedure against underpenalization. In order to improve on V -fold penalization
performances, some heuristic should be found concerning the slight raising needed on the level of
penalization C
V
.
As it was expected, the BIC-like V -fold penalization is sometimes rather inaccurate compared to
the two other model selection procedure (see Splice and Rcv1 datasets) and can also perform in
an efcient manner (as for Banana and Waveform datasets). For instance, in the case of Rcv1, the
lack of accuracy of the strategy of identication can be explained by the fact that on this example,
the linear kernel is also known to be well adapted, so that the Gaussian RBF kernel has to be very
smoothed (small values of ) and the bias issue is more or less irrelevant in this case.
Table 1: Classication error of the three studied method on the simulated data.
Optimal CrossV PenVF PenBIC
n = 50
V = 5
V = 10
V = 50
25.941.86
25.941.86
25.941.86
30.033.21
28.742.80
28.511.89
31.254.02
28.753.23
29.402.66
29.604.30
28.252.54
28.652.00
n = 100
V = 5
V = 10
V = 50
24.062.00
24.062.00
24.062.00
26.602.33
26.191.77
26.912.75
27.172.19
26.532.31
27.002.73
26.632.51
26.472.63
26.212.83
n = 200
V = 5
V = 10
V = 50
22.871.16
22.871.16
22.871.16
24.842.74
25.453.20
25.413.21
26.713.72
25.663.30
25.473.23
24.632.58
24.802.42
24.79 2.90
1
http://www.keerthis.com/comparison.html
4
Table 2: Classication test errors on the four benchmark datasets.
Banana Waveform Splice Rcv1
n Methods V = 5 V = 10 V = 5 V = 10 V = 5 V = 10 V = 5 V = 10
CrossV 17.18 14.48 15.81 15.69 24.86 24.39 26.03 25.83
50 PenBIC 16.89 17.93 15.93 15.62 24.94 24.67 28.63 32.33
PenVF 18.27 16.56 15.49 15.66 24.94 24.40 26.88 30.79
CrossV 14.22 14.43 13.22 12.71 21.96 22.46 23.62 23.23
100 PenBIC 13.73 13.80 17.91 14.16 28.24 24.58 27.12 26.93
PenVF 15.11 15.27 20.48 12.62 26.44 21.95 23.59 23.23
CrossV 11.36 12.61 11.24 11.34 16.59 16.52 15.39 15.37
200 PenBIC 11.48 11.99 12.26 12.38 19.43 20.79 21.64 15.37
PenVF 12.53 12.13 11.78 11.74 16.67 16.60 21.64 15.37
4 Conclusion
We have presented two new procedures for the tuning of the hyperparameters involved in the kernel
SVM, and compare it to VFCV, both on simulated and benchmark data.
On the one hand, the so-called V -fold penalization originally proposed in [1] is a general purpose
penalization procedure that indeed aims at improving on VFCV by correcting its bias, and the pro-
cedure is proved in [1] to be asymptotically optimal in a regression setting. However, the datasets
that we consider have small or medium numbers of observations and in this case, the bias of VFCV
can be viewed as a slight overpenalization that helps to avoid underpenalization. Thus, VFCV does
almost systematically better than V -fold penalization in our experiments. Despite this advantage
for VFCV, it should be noticed that V -fold penalization also provides a rather efcient calibration
of SVM hyperparameters, and its performance generally stays at less than 1% of the VFCV perfor-
mance for the same choice of V . Therefore, one can hope to improve on performances of V -fold
penalization, eventually becoming then competitive with VFCV in a nonasymptotic framework, by
a better understanding of the right level of penalization needed in V -fold penalization.
On the other hand, we have designed a BIC-like procedure which uses V -fold penalization and
we highlight that such a procedure can improve on VFCV, even in terms of prediction. This fact
is actually promising, and the presented informal criterion relying on the stability of the selected
estimators still has to be more investigated in order to become a precise decision tool between
prediction and identication strategies for model selection.
References
[1] Sylvain Arlot. V -fold cross-validation improved: V -fold penalization, February 2008. arXiv:0802.0566v2.
[2] Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. Statistics Surveys, 4:4079 (electronic), 2010.
[3] G. Blanchard, O. Bousquet, and P. Massart. Statistical performance of support vector machines. Annals of Statistics, 36(2):489531.
[4] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classiers. In Proceedings of the fth annual workshop
on Computational learning theory, COLT 92, pages 144152, New York, NY, USA, 1992. ACM.
[5] Leo Breiman and Philip Spector. Submodel selection and evaluation in regression. the x-random case. International Statistical Review, 60(3):pp. 291319,
1992.
[6] B. Efron. Bootstrap methods: another look at the jackknife. Ann. Statist., 7(1):126, 1979.
[7] Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. The entire regularization path for the support vector machine. J. Mach. Learn. Res., 5:13911415
(electronic), 2003/04.
[8] G. Schwarz. Estimating the dimension of a model. Ann.Stat., 6:461464, 1995.
[9] Jun Shao. An asymptotic theory for linear model selection. Statist. Sinica, 7(2):221264, 1997. With comments and a rejoinder by the author.
[10] V. N. Vapnik and A. Ya. Chervonenkis. Ordered risk minimization. Automation and Remote Control, 35:12261235, 1974.
[11] Yuhong Yang. Consistency of cross validation for comparing regression procedures. Ann. Statist., 35(6):24502473, 2007.
5

S-ar putea să vă placă și