Documente Academic
Documente Profesional
Documente Cultură
evaluation metrics
cross-validation
micro- and macro-averaging
Multi-Label Classification
J. Frnkranz
Document Representation
Rows:
Columns:
training documents
words in the training documents
J. Frnkranz
J. Frnkranz
Concept Representation
Lazy Learning
J. Frnkranz
Example Availability
Batch Learning
Active Learning
The learner may choose an example and ask the teacher for
the relevant training information
J. Frnkranz
J. Frnkranz
Induction of Classifiers
The most popular learning problem:
Task:
Experience:
Performance Measure:
J. Frnkranz
10
J. Frnkranz
Induction of Classifiers
Example
Training
Classifier
11
An inductive learning
algorithm searches in a given
family of hypotheses (e.g.,
decision trees, neural
networks) for a member that
optimizes given quality
criteria (e.g., estimated
predictive accuracy or
misclassification costs).
Classification
J. Frnkranz
Induction of Classifiers
Typical Characteristics
12
J. Frnkranz
New Example
Training
13
Classification
J. Frnkranz
kNN Classifier
14
J. Frnkranz
15
J. Frnkranz
Rocchio Classifier
assumption:
+ +
p+
+
+
16
p-
J. Frnkranz
PPe
eaac
cee
WORDS
17
J. Frnkranz
cee
eaac
PPe
War
War
and
and
Peace
Peace
WORDS
18
J. Frnkranz
Class-conditional Probabilities
Business
gee
nng
haa
CCh
iiss
riiss
CCr
rr
ccee
occ
SSo
Sports
Politics
19
J. Frnkranz
Independence Assumption
In other words:
Important:
20
J. Frnkranz
and
and
Peace
Peace
p Sportsd
p Politicsd
p Businessd
21
J. Frnkranz
Bayesian Classification
predict the class c that has the highest probability given the
document D
c=arg max c pcd
Problem:
p cd=
pdc p c
p d
equivalently
p d=c pdc pc
Bayes Classifier:
23
J. Frnkranz
p dc = p t 1 , t 2 ,.... t nc
d
p dc = pt i c
i=1
c=arg maxc
p t i c p c
i =1
24
J. Frnkranz
25
J. Frnkranz
Straight-forward approach:
p t i=wc =
n w ,c
wW
n w , c =dc n d , w
Problem:
n w, c
Smoothing of probabilities:
26
J. Frnkranz
better estimate:
d
{n d , wwd }
p dc =
n!
n
=
i 1, i 2, ... i k i 1 !i 2 !...i k !
p wcn d , w
w d
w d
i=1...d
pdc= pl=d c
d
{nd , ww d }
pwcn d , w
w d
27
J. Frnkranz
Binary Model
pdc= ptc
t d
t W , td
1 ptc=
t d
ptc
1 p tc
1 p tc
t W
to account for t d
28
J. Frnkranz
i=1
p cd=
c' e l
c'
lc
1
1c ' c e
l c' l c
29
LR d=
pcd
1 pcd
J. Frnkranz
Rainbow (McCallum)
http://www.cs.umass.edu/~mccallum/bow/rainbow/
30
J. Frnkranz
Performance analysis
31
J. Frnkranz
Graphs taken from Andrew McCallum and Kamal Nigam:A Comparison of Event Models for
Naive Bayes Text Classification. AAAI-98 Workshop on "Learning for Text Categorization".
http://www.cs.umass.edu/~mccallum/papers/multinomial-aaai98w.ps
d
n d , w log p wc=bdw NB
{nd , ww d } w d
33
J. Frnkranz
Probabilistic approach
Discriminative approach
34
J. Frnkranz
Find
Find ww11,,ww22,,b,
b, such
such that
that
ww11xx11++ww22xx22++bb00 for
for red
red points
points
ww11xx11++ww22xx22++bb00 for
for green
green points
points
Web Mining | Text Classification | V2.0
35
J. Frnkranz
Which Hyperplane?
36
J. Frnkranz
37
J. Frnkranz
Support vectors
Classifier is:
Maximize
margin
f (xi) = sign(wTxi + b)
Web Mining | Text Classification | V2.0
38
J. Frnkranz
Geometric Margin
wT x + b
Distance from example to the separator is r = y
w
39
J. Frnkranz
wTxs + b = 1
wTxr + b = 1
Assumption:
w is normalized so that:
mini=1,,n |wTxi + b| = 1
This implies:
wT x + b = 0
wT(xrxs) = 2
= ||xrxs||2 = 2/||w||2
Web Mining | Text Classification | V2.0
40
J. Frnkranz
yiw x i b=1
2
the margin is =
w
41
J. Frnkranz
Find
Findwwand
andbbsuch
suchthat
thatthe
themargin
margin
=
2
isismaximized;
maximized;
w
and
andfor
forall
all{(x
{(xi i,,yyi)}
i)}
wwTTxxi i++bb+1
+1 ififyyi=1;
i=1;
wwTTxxi i++bb1
1 ififyyi i==-1
-1
Find
Findwwand
andbbsuch
suchthat
that
T
(w)
(w)==
wwTww isisminimized;
minimized;
TT
and
y
(w
xxi i++b)
andfor
forall
all{(x
{(xi i,y,yi)}:
)}:
y
(w
b)11
i
i
i
42
J. Frnkranz
This is now
optimizing a quadratic function
subject to linear constraints
Find
Findwwand
andbbsuch
suchthat
that
T
(w)
(w)=
=wwTww isisminimized;
minimized;
TT
and
yyi (w
xxi ++b)
andfor
forall
all{(x
{(xi i,y,yi)}:
)}:
(w
b)11
i
i
i
where a Lagrange
multiplier i is associated
with every constraint
in the primary problem:
Find
Find11
nnsuch
suchthat
that
T
Q()
Q()=
=i i --
iijyjyiyiyjxjxi iTxxj jisismaximized
maximized
and
and
(1)
(1)
iyiyi i==00
(2)
(2)i i00for
forall
alli i
43
J. Frnkranz
y
x
(x) = w x + b = i iyi ixi i xx ++ bb
44
J. Frnkranz
i
j
45
J. Frnkranz
46
J. Frnkranz
i j i j i
and
and
(1)
(1)
iyiyi i==00
(2)
(2) 00i iCCfor
forall
alli i
NOTE: Neither slack variables i nor their Lagrange multipliers
appear in the dual problem!
T
f(x)
=
y
x
i
i
i
f(x) = iyixiTxx++bb
where
wherekk==argmax
argmaxkkkk
Web Mining | Text Classification | V2.0
47
J. Frnkranz
In 2 dims:
score = w1x1+w2x2+b.
in general:
score 0: yes
score < 0: no
48
J. Frnkranz
49
J. Frnkranz
Non-linear SVMs
0
Web Mining | Text Classification | V2.0
50
J. Frnkranz
: x (x)
51
J. Frnkranz
J. Frnkranz
Kernels
Common kernels
Linear:
K x i , x j =xTi x j
Polynomial:
K x i , x j =1x i x j
xi x j
K x i , x j =e
Web Mining | Text Classification | V2.0
53
J. Frnkranz
In classification terms:
virtually all document sets are separable, for almost any
classification
54
J. Frnkranz
Performance
Different Kernels
Computational Efficiency
55
J. Frnkranz
Multi-Class Classification
Many problems have c > 2 classes not just two
nave Bayes, k-NN, Rocchio can handle multiple classes
naturally
Support Vector Machines
need to be extended
56
J. Frnkranz
One-against-all
57
J. Frnkranz
Pairwise Classification
c(c-1)/2 problems
each class against each other
class
58
si = { P C i C j 0.5 }
j=1
{x }=
1 if x= true
0 if x= false
si = P C i C j
j=1
59
J. Frnkranz
Occam's Razor
Entities should not be multiplied beyond necessity.
William of Ockham (1285 - 1349)
63
J. Frnkranz
Overfitting
Overfitting
Given
inconsistent classification
measurement errors
missing values
Capacity control
64
J. Frnkranz
Capacity Control
65
J. Frnkranz
Note:
66
J. Frnkranz
69
J. Frnkranz
70
J. Frnkranz
Validation on data
On-line Validation
71
J. Frnkranz
Out-of-Sample Testing
overfitting!
waste of data
labelling may be expensive
72
J. Frnkranz
Cross-Validation
73
J. Frnkranz
Evaluation
In Machine Learning:
Accuracy = percentage of correctly classified examples
Confusion Matrix:
Classified Classified
as +
as -
Is +
Is -
a
b
a+b
c
d
c+d
a+c
b+d
n
a
recall =
(a + c)
precision =
a
( a + b)
(a + d )
accuracy =
n
Web Mining | Text Classification | V2.0
74
J. Frnkranz
nA,A
B nA,B
C nA,C
D nA,D
nA
nB,A
nB,B
nB,C
nB,D
nB
nC,A
nC,B
nC,C
nC,D
nC
nD,A
nD,B
nD,C
nD,D
nD
nA
nB
nC
nD
n
75
J. Frnkranz
Precision of Class X:
z.B:
X = A , X ={ B , C , D }
classified
X
is X
is not X
76
nX , X
nX , X
nX
classified
not X
nX , X
nX , X
nX
nX
nX
n
J. Frnkranz
Two strategies:
Micro-Averaging:
Macro-Averaging:
Basic difference:
77
J. Frnkranz
Predicted
Predicted
C1 C1
15 5 20
10 70 80
25 75 100
C2 C2
20 10 30
12 58 70
32 68 100
C3 C3
45 5 50
5 45 50
50 50 100
precc1=
15
=0.600
25
prec c2=
avg. prec=
recl c1=
15
=0.750
20
avg. recl =
C2
C2
20
=0.625
32
True
C1
C1
Predicted
True
True
Macro-Averaging
C3
C3
precc3=
45
=0.900
50
20
=0.667
30
recl c3=
45
=0.900
50
J. Frnkranz
Predicted
Predicted
C1 C1
15 5 20
10 70 80
25 75 100
C2 C2
20 10 30
12 58 70
32 68 100
C3 C3
45 5 50
5 45 50
50 50 100
C2
C2
True
C1
C1
Predicted
True
True
Micro-Averaging
C3
C3
80
avg. prec=
=0.748
107
avg. recl =
80
=0.800
100
True
Predicted
C
C
C
C
80 20 100
27 173 200
107 193 300
79
Micro-Averaged estimates
are in this case higher
because the performance
on the largest class (C3)
was best
J. Frnkranz
Benchmark Datasets
Publicly available Benchmark Datasets facilitate standardized
evaluation and comparisons to previous work
Reuters-21578
OHSUMED
20 newsgroups
WebKB
Industry sectors
80
J. Frnkranz
Reuters-21578 Dataset
Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
J. Frnkranz
82
J. Frnkranz
Reuters
Accuracy with different Algorithms
earn
Rocchio
92,9%
NBayes
95,9%
Trees
97,8%
LinearSVM
98,2%
acq
64,7%
87,8%
89,7%
92,8%
money-fx
46,7%
56,6%
66,2%
74,0%
grain
67,5%
78,8%
85,0%
92,4%
crude
70,1%
79,5%
85,0%
88,3%
trade
65,1%
63,9%
72,5%
73,5%
interest
63,4%
64,9%
67,1%
76,3%
ship
49,2%
85,4%
74,2%
78,0%
wheat
68,9%
69,7%
92,5%
89,7%
corn
48,2%
65,3%
91,8%
91,1%
Avg Top 10
64,6%
81,5%
88,4%
91,4%
61,7%
75,2%
na
86,4%
Results taken from S. Dumais et al. 1998
83
J. Frnkranz
84
J. Frnkranz
85
J. Frnkranz
Precision
0,8
0,6
Linear SVM
0,4
Decision Tree
Nave Bayes
0,2
Find Similar
0
0
0,2
0,4
0,6
0,8
Recall
Graph taken from S. Dumais, LOC talk, 1999.
Web Mining | Text Classification | V2.0
86
J. Frnkranz
87
J. Frnkranz
Multiple Datasets
20 newsgroups
the Recreation sub-tree
of the Open Directory
University Web pages
from WebKB.
88
J. Frnkranz
Multi-Label Classification
Multilabel Classification:
there might be multiple label i associated with each example
e.g., keyword assignments to texts
, R= 1 R
R
HamLoss R
c
89
J. Frnkranz
Disadvantage:
Each label is considered in isolation
no correlations between labels can be modeled
Disadvantage:
additional methods are needed to convert rankings into multi-label
classifications
Web Mining | Text Classification | V1.0
90
J. Frnkranz
Preference Learning
We can say that for each training example, we know that the
relevant label is preferred over all other labels (i > j)
Pairwise classification learns a binary classifier Ci,j for each
pair of labels { i, j }, which indicates whether i > j or i < j
Predicted is the most preferred object (the one that receives
the most votes from the binary classifiers)
Actually, we can produce a ranking of all labels according to
the number of votes
J. Frnkranz
3
1
relevant labels
5
|R||I| preferences
2
7
irrelevant labels
Problem:
92
J. Frnkranz
Calibrated Multi-Label PC
Key idea:
introduce a neutral label into the preference scheme
the neutral label is
less relevant than all relevant classes
more relevant than all irrelevant classes
3
1
0
5
neutral label
c = |R| + |I|
new preferences
2
7
I
at prediction time, all labels that are ranked above
the neutral label are predicted to be relevant
Web Mining | Text Classification | V1.0
93
6
4
J. Frnkranz
Calibrated Multi-Label PC
Effectively, Calibrated Multi-Label Pairwise Classification (CMLPC)
combines BR and PC
R
The binary training
sets for preferences
involving the neutral
label are the same
training sets that
are used for binary
relevance ranking!
94
J. Frnkranz
Results:
Reuters CV1 Text Classification Task
On REUTERS-CV1 text classification dataset
804,414 Reuters newswire articles (535,987 train, 268,427 test)
103 different labels, 3 labels per example
base learner: perceptrons
Multi-Label
Ranking Loss Functions
Loss Function
95
J. Frnkranz
EuroVOC Classification
of EC Legal Texts
Eur-Lex database
20,000 documents
4,000 labels in EuroVOC
descriptor
5 labels per document
Results:
average precision of pairwise method is almost 50%
on average, the 5 relevant labels can be found within the first 10 labels of the
ranking of all 4000 labels
one-against-all methods (BR and MMP) had a precision < 30%.
but were more efficient (even though the pairwise approach used less arithmetic
operations)
Web Mining | Text Classification | V1.0
96
J. Frnkranz