Empirical Methods

Empirical Methods: Evaluating and Comparing Classifiers reminder about test sets
just a sample, estimate of errtrue must be representative (drawn i.i.d. from problem distribution) i.i.d. = independently identically distributed note: it is advisable to randomize the order of the examples, in case there is any bias/trend
Confidence intervals
how accurate is our estimate of accuracy? best estimate of errtrue=errsample but could vary due to sample error variance of estimate should shrink as test set grows can use Binomial distribution and Law of Large numbers to show that standard deviation of et is s= valid when: thus a confidence interval can be constructed: n30 or
es-1.96s et es+1.96s (with 95% prob.; a<0.05)
np(1-p)5
1.96 comes from Normal distribution, area between tails = ~95%
example
suppose a decision make 15/200 mistakes on independent test examples then a 95% C.I. on accuracy 85.0%4.9 or [80.1,89.9] running more test examples could tighten this
cross-validation is more reliable

divide training example into 10 bins, B1..B10 build 10 different classifiers, T1..T10
Ti is based on {B1B10\Bi}
test each classifier Ti on the hold-out set Bi can calculate the mean and standard error of accuracies this is 10-fold CV, but you could do 5x or 20x because of small number of folds, use ta/2,N-1S.E. for C.I.
ta/2,N-1 from Students t-distribution variance better reflects differences due to sampling
advantage:
disadvantage:
of course, each classifier is only trained on 90% of the data could be sensitive to variation due to training sets (stability of classifer)
note: test sets are distinct, but training sets overlap variations: leave-one-out, 5x2 CV, bootstrapping...
Comparing Classifiers
if you measure sample errors of La and Lb independently... Hypothesis testing (statistics)
null hypothesis is that La and Lb are effectively equivalent, and any performance differences are due to sampling alternative hypothesis:
two-sided: err(La)err(Lb) one-sided: err(La)<err(Lb)
a=0.05 is max probability of type I error (rejecting null hypothesis when it is actually true, i.e. thinking the classifiers are different when they are not really)
need to calculate the difference and estimate the variance of the difference
variance of the diff is sum of the variances, weighted by sizes of test sets difference in acc is significant if 0 is within d1.96s (or whatever z is appropriate for a) not very sensitive for comparing classifiers
if you run head-to-head comparisons between La and Lb, say on same CV folds, you can do a pooled T-test
increases sensitivity of comparison calculate di=err(La)-err(Lb) over all k=10 folds calculate mean and std. dev. of d (below) look up t in Students t distribution (see wiki) for a and dof=k-1 difference is significant if
0 is within...
example
suppose acc(La)=903.0%, acc(Lb)=883.2% by eye: difference probably not significant because each range overlaps the other mean hyp. test:
suppose Na=Nb=100 d=0.02, s=sqrt(0.9*0.1/100+0.88*0.12/100)=0.044 z=(0.02-0)/0.044=~0.45<1.64 (1-sided, a=0.05)
as paired t-test:
CV fold 1 2 3 ... means: 90 88 1.56 acc(La) 92 87 89 acc(Lb) 91 87 88 d=La-Lb +1 0 +1
suppose for d: mean=1.56 std dev=0.6
0.90
sqrt[1/k(k-1)S(d-md)2]=2.6 2.6>1.833 (signficant) (1-sided, a=0.05, dof=9) Lb was better only twice
...0.88 0.90 0.92... acc(La)
Binomial Distribution
mean = np variance = np(1-p)
Gaussian Distribution
z-score:
note: standard errors on estimate of mean shrink with number of observations:
other measures of performance

sensitivity and specificity precision vs. recall dividing errors into 2 types:
false positives vs. false negatives
ROC curves and AUC F-measure
note that sensitivity 1-specificity and precision 1-recall they are (semi-)independent; however, they are usually inversely related...
combining measures
F1 measure
harmonic mean of precision and recall
Mathews coefficient
correlation coefficient of errors (approximately geometric mean of sensitivity and specificity)
ROC (receiver-operator characteristic curves)

different algorithm produce different curves point along curve represent different parameters settings, like thresholds
optimal tradeoff point
AUC (area under curve)

a measure of performance of algorithms integrated over all parameter settings how flat is ROC curve? ideal classifier would have AUC=1
other statistical tests

McNemars test: chi2 statistic on errors ANOVA: compare multiple classifiers
analysis of variance calculate sum of squares within vs. between groups
Kruskal-Wallis: compare ranks of algorithms on multiple datasets Bonferroni correction: use a/|tests| to avoid spurious hit

Empirical Methods

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Empirical Methods

Încărcat de

Drepturi de autor:

Formate disponibile

Empirical Methods: Evaluating and Comparing Classifiers reminder about test sets

1.96 comes from Normal distribution, area between tails = ~95%

cross-validation is more reliable

suppose for d: mean=1.56 std dev=0.6

mean = np variance = np(1-p)

other measures of performance

ROC curves and AUC F-measure

ROC (receiver-operator characteristic curves)

optimal tradeoff point

AUC (area under curve)

other statistical tests

S-ar putea să vă placă și