Sunteți pe pagina 1din 16

Empirical Methods: Evaluating and Comparing Classifiers reminder about test sets

just a sample, estimate of errtrue must be representative (drawn i.i.d. from problem distribution) i.i.d. = independently identically distributed note: it is advisable to randomize the order of the examples, in case there is any bias/trend

Confidence intervals
how accurate is our estimate of accuracy? best estimate of errtrue=errsample but could vary due to sample error variance of estimate should shrink as test set grows can use Binomial distribution and Law of Large numbers to show that standard deviation of et is s= valid when: thus a confidence interval can be constructed: n30 or
es-1.96s et es+1.96s (with 95% prob.; a<0.05)
np(1-p)5

1.96 comes from Normal distribution, area between tails = ~95%

example
suppose a decision make 15/200 mistakes on independent test examples then a 95% C.I. on accuracy 85.0%4.9 or [80.1,89.9] running more test examples could tighten this

cross-validation is more reliable


divide training example into 10 bins, B1..B10 build 10 different classifiers, T1..T10
Ti is based on {B1B10\Bi}

test each classifier Ti on the hold-out set Bi can calculate the mean and standard error of accuracies this is 10-fold CV, but you could do 5x or 20x because of small number of folds, use ta/2,N-1S.E. for C.I.
ta/2,N-1 from Students t-distribution variance better reflects differences due to sampling

advantage:

disadvantage:
of course, each classifier is only trained on 90% of the data could be sensitive to variation due to training sets (stability of classifer)

note: test sets are distinct, but training sets overlap variations: leave-one-out, 5x2 CV, bootstrapping...

Comparing Classifiers
if you measure sample errors of La and Lb independently... Hypothesis testing (statistics)
null hypothesis is that La and Lb are effectively equivalent, and any performance differences are due to sampling alternative hypothesis:
two-sided: err(La)err(Lb) one-sided: err(La)<err(Lb)

a=0.05 is max probability of type I error (rejecting null hypothesis when it is actually true, i.e. thinking the classifiers are different when they are not really)

need to calculate the difference and estimate the variance of the difference

variance of the diff is sum of the variances, weighted by sizes of test sets difference in acc is significant if 0 is within d1.96s (or whatever z is appropriate for a) not very sensitive for comparing classifiers

if you run head-to-head comparisons between La and Lb, say on same CV folds, you can do a pooled T-test
increases sensitivity of comparison calculate di=err(La)-err(Lb) over all k=10 folds calculate mean and std. dev. of d (below) look up t in Students t distribution (see wiki) for a and dof=k-1 difference is significant if
0 is within...

example
suppose acc(La)=903.0%, acc(Lb)=883.2% by eye: difference probably not significant because each range overlaps the other mean hyp. test:
suppose Na=Nb=100 d=0.02, s=sqrt(0.9*0.1/100+0.88*0.12/100)=0.044 z=(0.02-0)/0.044=~0.45<1.64 (1-sided, a=0.05)

as paired t-test:
CV fold 1 2 3 ... means: 90 88 1.56 acc(La) 92 87 89 acc(Lb) 91 87 88 d=La-Lb +1 0 +1

suppose for d: mean=1.56 std dev=0.6

0.90

sqrt[1/k(k-1)S(d-md)2]=2.6 2.6>1.833 (signficant) (1-sided, a=0.05, dof=9) Lb was better only twice
...0.88 0.90 0.92... acc(La)

Binomial Distribution

mean = np variance = np(1-p)

Gaussian Distribution

z-score:
note: standard errors on estimate of mean shrink with number of observations:

other measures of performance


sensitivity and specificity precision vs. recall dividing errors into 2 types:
false positives vs. false negatives

ROC curves and AUC F-measure

note that sensitivity 1-specificity and precision 1-recall they are (semi-)independent; however, they are usually inversely related...

combining measures
F1 measure
harmonic mean of precision and recall

Mathews coefficient
correlation coefficient of errors (approximately geometric mean of sensitivity and specificity)

ROC (receiver-operator characteristic curves)


different algorithm produce different curves point along curve represent different parameters settings, like thresholds

optimal tradeoff point

AUC (area under curve)


a measure of performance of algorithms integrated over all parameter settings how flat is ROC curve? ideal classifier would have AUC=1

other statistical tests


McNemars test: chi2 statistic on errors ANOVA: compare multiple classifiers
analysis of variance calculate sum of squares within vs. between groups

Kruskal-Wallis: compare ranks of algorithms on multiple datasets Bonferroni correction: use a/|tests| to avoid spurious hit

S-ar putea să vă placă și