Documente Academic
Documente Profesional
Documente Cultură
just a sample, estimate of errtrue must be representative (drawn i.i.d. from problem distribution) i.i.d. = independently identically distributed note: it is advisable to randomize the order of the examples, in case there is any bias/trend
Confidence intervals
how accurate is our estimate of accuracy? best estimate of errtrue=errsample but could vary due to sample error variance of estimate should shrink as test set grows can use Binomial distribution and Law of Large numbers to show that standard deviation of et is s= valid when: thus a confidence interval can be constructed: n30 or
es-1.96s et es+1.96s (with 95% prob.; a<0.05)
np(1-p)5
example
suppose a decision make 15/200 mistakes on independent test examples then a 95% C.I. on accuracy 85.0%4.9 or [80.1,89.9] running more test examples could tighten this
test each classifier Ti on the hold-out set Bi can calculate the mean and standard error of accuracies this is 10-fold CV, but you could do 5x or 20x because of small number of folds, use ta/2,N-1S.E. for C.I.
ta/2,N-1 from Students t-distribution variance better reflects differences due to sampling
advantage:
disadvantage:
of course, each classifier is only trained on 90% of the data could be sensitive to variation due to training sets (stability of classifer)
note: test sets are distinct, but training sets overlap variations: leave-one-out, 5x2 CV, bootstrapping...
Comparing Classifiers
if you measure sample errors of La and Lb independently... Hypothesis testing (statistics)
null hypothesis is that La and Lb are effectively equivalent, and any performance differences are due to sampling alternative hypothesis:
two-sided: err(La)err(Lb) one-sided: err(La)<err(Lb)
a=0.05 is max probability of type I error (rejecting null hypothesis when it is actually true, i.e. thinking the classifiers are different when they are not really)
need to calculate the difference and estimate the variance of the difference
variance of the diff is sum of the variances, weighted by sizes of test sets difference in acc is significant if 0 is within d1.96s (or whatever z is appropriate for a) not very sensitive for comparing classifiers
if you run head-to-head comparisons between La and Lb, say on same CV folds, you can do a pooled T-test
increases sensitivity of comparison calculate di=err(La)-err(Lb) over all k=10 folds calculate mean and std. dev. of d (below) look up t in Students t distribution (see wiki) for a and dof=k-1 difference is significant if
0 is within...
example
suppose acc(La)=903.0%, acc(Lb)=883.2% by eye: difference probably not significant because each range overlaps the other mean hyp. test:
suppose Na=Nb=100 d=0.02, s=sqrt(0.9*0.1/100+0.88*0.12/100)=0.044 z=(0.02-0)/0.044=~0.45<1.64 (1-sided, a=0.05)
as paired t-test:
CV fold 1 2 3 ... means: 90 88 1.56 acc(La) 92 87 89 acc(Lb) 91 87 88 d=La-Lb +1 0 +1
0.90
sqrt[1/k(k-1)S(d-md)2]=2.6 2.6>1.833 (signficant) (1-sided, a=0.05, dof=9) Lb was better only twice
...0.88 0.90 0.92... acc(La)
Binomial Distribution
Gaussian Distribution
z-score:
note: standard errors on estimate of mean shrink with number of observations:
note that sensitivity 1-specificity and precision 1-recall they are (semi-)independent; however, they are usually inversely related...
combining measures
F1 measure
harmonic mean of precision and recall
Mathews coefficient
correlation coefficient of errors (approximately geometric mean of sensitivity and specificity)
Kruskal-Wallis: compare ranks of algorithms on multiple datasets Bonferroni correction: use a/|tests| to avoid spurious hit