Sunteți pe pagina 1din 11

Maximum Expected F-Measure Training

of Logistic Regression Models


Martin Jansche, HLT 2005

presented by Philip Zigoris


Motivation

Learning algorithms generally optimize 0-1


accuracy.
Often this is not the performance measure we are
concerned with.
This tends to be the case with datasets heavily
skewed towards one class or when the cost of error
differs for each class.
Outline


Review: Logistic Regression

Review: F_alpha Performance Measure

Optimizing F_alpha
– Formulation and Algorithm
– Comparison to ML
– Experimental Results

Conclusion
Review: Logistic Regression

Sample: (x i , y i ) Î Â k
´ {±1}
1
Pr(+1 | x,q) = = g(x ×q)
1+ e - x×q

Classifier: y
MAP (x) = argmax Pr(+1 | x,q)
y

Objective: θ = argmax Õ g(y i (x i ×q))


*

q i
Review: F-measure

Predicted
A : true positive +1 -1
B : misses True +1 A B
C : false alarms -1 C D
D : true negative
æa 1- a ö- 1
Precision: A/(A+C) F a (R,P) = ç + ÷
èR P ø
vs. A
Fa (A,B,C) =
Recall: A/(A+B) A + a B + (1- a )C
Section 4: Relation to Expected Utility

é ù
êå Iy MAP (x i )= +1Iy i = +1ú
Express F as a rational i
éAù
1ê ú 1ê ú
function of a vector U S = êå Iy MAP (x i )=- 1Iyi = +1ú = Bú
valued utility nê i ú n ê
êëCúû
êå Iy (x )= +1Iy =- 1ú
êë i MAP i i
úû
(Approximately) Optimizing F

Similar to logistic regression:


Iy MAP (x i )= +1 » Pr(+1 | x,q)
We can also approximate A,B,C:
÷ (q) =
A  å g(x ×q)
÷ (q)
i
y i = +1
÷ (q) = A 
÷ (q) = n - A 
B  ÷ (q) F 
pos ÷ pos
a n pos + (1- a ) m 
÷ (q) = m 
C  ÷ (q)
÷ pos - A 
÷ pos = å g(x ×q)

i
Comparison to maximum
likelihood: Toy dataset
x y
0 +1
1 -1
2 +1
3 -1
Comparison to maximum
likelihood: Toy dataset
Maximum Likelihood gives all +1 classifier (0.35,0.57)
•Recall is 1
•Precision is 3/4
•F.5=6/7 ≈ 0.86
Classifier trained with F.5 approximation (20, 15)
•F.25=4/5 ≈ 0.8
•Gives all one classifier (results the same as
ML) trained with F approximation labels first two
Classifier .25
examples negative (-20,15)
•F.5 = 4/5 ≈ 0.8
•F.25=8/9 ≈ 0.89
Experiments: Text Summarization

Task: Classify sentence (and like units) as belonging to summarization


Data:
•3535 train, 408 test instances
•29 features (1 binary, 28 real/integer valued)
•All features present

Results:

Data source: Sameer Maskey and Julia Hirschberg. Comparing lexical,


acoustic/prosodic, structural and discourse features for speech summarization. In
Conclusions
Main idea:
Approximate MAP classification with the
probability itself. This gives a continuous
potential over parameters which can be
optimized with standard techniques
Main criticism:
Experiments are inconclusive.

S-ar putea să vă placă și