Sunteți pe pagina 1din 26

1

Text categorization
Feature selection: chi square test
2
Slides adapted from Mary Ellen Califf
Joint Probability Distribution
The joint probability distribution for a set of random variables X
1
X
n

gives the probability of every combination of values

P(X
1
,...,X
n
)
Sneeze Sneeze
Cold 0.08 0.01
Cold 0.01 0.9

The probability of all possible cases can be calculated by summing
the appropriate subset of values from the joint distribution.
All conditional probabilities can therefore also be calculated
P(Cold | Sneeze)
BUT its often very hard to obtain all the probabilities for a joint
distribution

3
Slides adapted from Mary Ellen Califf
Bayes Independence Example
Imagine there are diagnoses ALLERGY, COLD, and WELL and
symptoms SNEEZE, COUGH, and FEVER
Can these be correct numbers?

Prob Well Cold Allergy
P(d) 0.9 0.05 0.05
P(sneeze|d) 0.1 0.9 0.9
P(cough | d) 0.1 0.8 0.7
P(fever | d) 0.01 0.7 0.4

4
KL divergence (relative entropy)

=
x
x Q
x P
x P Q P D
) (
) (
log ) ( ) || (
Basis of comparing two probability distributions
5
Slide adapted from Paul Bennet
Text Categorization Applications
Web pages organized into category hierarchies
Journal articles indexed by subject categories (e.g., the Library
of Congress, MEDLINE, etc.)
Responses to Census Bureau occupations
Patents archived using International Patent Classification
Patient records coded using international insurance categories
E-mail message filtering
News events tracked and filtered by topics
Spam
6
Yahoo
News
Categories
7
Text Topic categorization
Topic categorization: classify the document into
semantics topics



The U.S. swept into the Davis
Cup final on Saturday when twins
Bob and Mike Bryan defeated
Belarus's Max Mirnyi and Vladimir
Voltchkov to give the Americans
an unsurmountable 3-0 lead in the
best-of-five semi-final tie.
One of the strangest, most
relentless hurricane seasons on
record reached new bizarre heights
yesterday as the plodding approach
of Hurricane Jeanne prompted
evacuation orders for hundreds of
thousands of Floridians and high
wind warnings that stretched 350
miles from the swamp towns south
of Miami to the historic city of St.
Augustine.

8
The Reuters collection
A gold standard
Collection of (21,578) newswire documents.
For research purposes: a standard text collection to compare
systems and algorithms
135 valid topics categories


9
Reuters

Top topics in Reuters

10
Reuters Document Example
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress
kicks off
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states
determining industry positions on a number of issues, according to the National Pork Producers
Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues,
including the future direction of farm policy and the tax law as it applies to the agriculture sector.
The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all
areas of the industry, the NPPC added. Reuter
&#3;</BODY></TEXT></REUTERS>
11
Classification vs. Clustering
Classification assumes labeled data: we know how
many classes there are and we have examples for
each class (labeled data).
Classification is supervised
In Clustering we dont have labeled data; we just
assume that there is a natural division in the data
and we may not know how many divisions (clusters)
there are
Clustering is unsupervised
12
Categories (Labels, Classes)
Labeling data
2 problems:
Decide the possible classes (which ones, how many)
Domain and application dependent
Label text
Difficult, time consuming, inconsistency between
annotators
13
Reuters Example, revisited
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress
kicks off
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states
determining industry positions on a number of issues, according to the National Pork Producers
Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues,
including the future direction of farm policy and the tax law as it applies to the agriculture sector.
The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all
areas of the industry, the NPPC added. Reuter
&#3;</BODY></TEXT></REUTERS>
Why not topic = policy ?
14
Binary vs. multi-way classification
Binary classification: two classes

Multi-way classification: more than two classes

Sometime it can be convenient to treat a multi-way
problem like a binary one: one class versus all the
others, for all classes
15
Features
>>> text = "Seven-time Formula One champion
Michael Schumacher took on the Shanghai circuit
Saturday in qualifying for the first Chinese
Grand Prix."
>>> label = sport
>>> labeled_text = LabeledText(text, label)



Here the classification takes as input the
whole string
Whats the problem with that?
What are the features that could be useful for
this example?

16
Feature terminology
Feature: An aspect of the text that is relevant to the
task
Some typical features
Words present in text
Frequency of words
Capitalization
Are there NE?
WordNet
Others?
17
Feature terminology
Feature: An aspect of the text that is relevant to the
task
Feature value: the realization of the feature in the text
Words present in text
Frequency of word
Are there dates? Yes/no
Are there PERSONS? Yes/no
Are there ORGANIZATIONS? Yes/no
WordNet: Holonyms (China is part of Asia),
Synonyms(China, People's Republic of China, mainland China)
18
Feature Types
Boolean (or Binary) Features
Features that generate boolean (binary) values.
Boolean features are the simplest and the most
common type of feature.

f
1
(text) = 1 if text contain elections
0 otherwise
f
2
(text) = 1 if text contain PERSON
0 otherwise

19
Feature Types
Integer Features
Features that generate integer values.
Integer features can be used to give classifiers access to
more precise information about the text.

f
1
(text) = Number of times elections occurs
f
2
(text) = Number of times PERSON occurs
20
_
2
statistic (pronounced kai square)
A commonly used method of comparing proportions.
Measures the lack of independence between a term and
a category


_
2
statistic (CHI)
21
Is jaguar a good predictor for the auto class?







We want to compare:
the observed distribution above; and
null hypothesis: that jaguar and auto are independent
_
2
statistic (CHI)
Term = jaguar Term = jaguar
Class = auto 2 500
Class = auto 3 9500
22
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?
If independent: P
r
(j,a) = P
r
(j) P
r
(a)
So, there would be N P
r
(j,a), i.e. N P
r
(j) P
r
(a)
occurances of jaguar
P
r
(j) = (2+3)/N;
P
r
(a) = (2+500)/N;
N=2+3+500+9500
N(5/N)(502/N)=2510/N=2510/10005 ~ 0.25






_
2
statistic (CHI)
Term = jaguar Term = jaguar
Class = auto 2 500
Class = auto 3 9500
23
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?







_
2
statistic (CHI)
Term = jaguar Term = jaguar
Class = auto 2 (0.25) 500
Class = auto 3 9500
expected: f
e
observed: f
o
24
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?







_
2
statistic (CHI)
Term = jaguar Term = jaguar
Class = auto 2 (0.25) 500 (502)
Class = auto 3 (4.75) 9500 (9498)
expected: f
e
observed: f
o
25
_
2
is interested in (f
o
f
e
)
2
/f
e
summed over all table entries:




The null hypothesis is rejected with confidence .999,
since 12.9 > 10.83 (the value for .999 confidence).
_
2
statistic (CHI)
) 001 . ( 9 . 12 9498 / ) 9498 9500 ( 502 / ) 502 500 (
75 . 4 / ) 75 . 4 3 ( 25 . / ) 25 . 2 ( / ) ( ) , (
2 2
2 2 2 2
< = + +
+ = =

p
E E O a j _
Term = jaguar Term = jaguar
Class = auto 2 (0.25) 500 (502)
Class = auto 3 (4.75) 9500 (9498)
expected: f
e
observed: f
o
26
There is a simpler formula for _
2
:






_
2
statistic (CHI)
N = A + B + C + D
A = #(t,c) C = #(t,c)
B = #(t,c) D = #(t, c)

S-ar putea să vă placă și