Sunteți pe pagina 1din 37

1

CSC 7810: Data Mining:


Algorithms and Applications
Lecture 2:
Text Mining
Data Mining / Knowledge Discovery
Structured Data Multimedia Free Text Hypertext (Web)
HomeLoan (
Loanee: Frank Rizzo
Lender: MWF
Agency: Lake View
Amount: $200,000
Term: 15 years
)
Frank Rizzo bought
his home from Lake
View Real Estate in
1992.
He paid $200,000
under a15-year loan
from MW Financial.
<a href>Frank Rizzo
</a> Bought
<a hef>this home</a>
from <a href>Lake
View Real Estate</a>
In <b>1992</b>.
<p>... Loans($200K,[map],...)
Mi ni ng Text Dat a: An I nt r oduct i on
2
Thi s Lect ur e
Data
Mining
Text
Mining
Data
Retrieval
Information
Retrieval
Search
(goal-oriented)
Discover
(opportunistic)
Structured
Data
Unstructured
Data (Text)
An Over vi ew
I nt roduct ion t o I nformat ion Ret rieval
Text Pre-processing
Ret rieval Models
Comput ing Similarit y
Lat ent Semant ic I ndexing
The Mining Tasks
3
An I nf or mat i on Ret r i eval Syst em
Gi ven:
A source of t ext ual document s
A user query (t ext based)
Fi nd:
A set (ranked) of document s t hat
are relevant t o t he query
IR
System
Query
E.g. Text
Documents
source
Ranked
Documents
Document
Document
Document
An I deal I R Syst em
Scope
able to search every document on the Internet
Speed
Runs almost in real-time
Recall:
always find every document relevant to our query
Precision:
no irrelevant documents in our result set
Ranking:
most relevant results would come first
4
Basi c Measur es f or Text Ret r i eval
Precision: t he percent age of ret rieved document s t hat are in fact relevant
t o t he query (i.e., correct responses)
Recall: t he percent age of document s t hat are relevant t o t he query and
were, in fact , ret rieved
| } { |
| } { } { |
Relevant
Retrieved Relevant
recall

=
| } { |
| } { } { |
Retrieved
Retrieved Relevant
precision

=
Relevant
Relevant &
Ret rieved Ret rieved
All Document s
F- measur e
Count
PREDI CTED CLASS
ACTUAL
CLASS
Class= Yes Class= No
Class= Yes
a b
Class= No
c d
c b a
a
p r
rp
b a
a
c a
a
+ +
=
+
=
+
=
+
=
2
2 2
(F) measure - F
(r) Recall
(p) Precision
F-measure is the harmonic
mean between r and p
5
Eval uat i on: Pr eci si on and Recal l
Given a query:
Are all ret rieved document s relevant ?
Have all t he relevant document s been
ret rieved?
Measures for syst em performance:
The first quest ion is about t he precision of t he
search
The second is about t he complet eness ( recall)
of t he search.
Pr eci si on- Recal l cur ve
6
Compare di f f erent ret ri eval al gori t hms
Compar e wi t h mul t i pl e quer i es
Comput e t he average precision at each recall level.
Draw precision recall curves
Do not forget t he F-score evaluat ion measure.
7
Rank Pr eci si on
Comput e t he precision values at some select ed rank
posit ions.
Mainly used in Web search evaluat ion.
For a Web search engine, we can comput e precisions
for t he t op 5, 10, 15, 20, 25 and 30 ret urned pages
as t he user seldom looks at more t han 30
pages.
Recall is not very meaningful in Web search.
I nf or mat i on r et r i eval ( I R)
A field developed in parallel wit h dat abase syst ems
I nformat ion ret rieval vs. dat abase syst ems
Some DB problems are not present in I R, e.g., updat e,
t ransact ion management , complex obj ect s
Some I R problems are not addressed well in DBMS,
e.g., unst ruct ured document s, approximat e search
using keywords and relevance.
Problem Definit ion: locat ing relevant document s based on user
input , such as keywords or example document s
Examples: Web search engine, Online library cat alogs.
8
I nf or mat i on Ret r i eval ( I R)
Concept ually, I R is t he st udy of finding needed
informat ion. I .e., I R helps users find informat ion t hat
mat ches t heir informat ion needs.
Expressed as queries
Hist orically, I R is about document ret rieval,
emphasizing document as t he basic unit .
Finding document s relevant t o user queries
Technically, I R st udies t he acquisit ion, organizat ion,
st orage, ret rieval, and dist ribut ion of informat ion.
I R ar chi t ect ur e
9
Semi - st r uct ur ed Dat a
Appl i cat i ons
Si gni f i cant proport i on of i nf ormat i on of great
pot ent i al val ue i s st ored i n document s. E.g.
medi cal records
Market i ng: Discover dist inct groups of pot ent ial buyers
according t o a user t ext based profile
e.g. amazon
I ndust ry: I dent ifying groups of compet it ors web pages
e.g., compet ing product s and t heir prices
Job seeki ng: I dent ify paramet ers in searching for j obs
Spam Fi l t eri ng: I dent ifying spam emails aut omat ically
Document Summari zat i on: Summarize a set of
document s and reduce user-burden
e.g., Google News
10
St r uct ur i ng Text ual I nf or mat i on
Many met hods designed t o analyze st ruct ured
dat a
I f we can represent document s wit h a set of
at t ribut es , t hen we can use exist ing dat a mining
met hods
How t o represent a document ?
St ruct ured
represent at ion
Apply DM met hods
t o find pat t erns
among document s
Def i ni t i on
The non trivial extraction of implicit, previously
unknown, and potentially useful information from
(large amount of) textual data.
An exploration and analysis of textual (natural-
language) data by automatic and semi automatic
means to discover new knowledge.
11
Text Mi ni ng Pr ocess
Bag- of - Wor ds Appr oach
Four score and seven
years ago our fathers brought
forth on this continent, a new
nation, conceived in Liberty,
and dedicated to the
proposition that all men are
created equal.
Now we are engaged in a
great civil war, testing
whether that nation, or
nation 5
civil - 1
war 2
men 2
died 4
people 5
Liberty 1
God 1

Feature
Extraction
Documents
Token Sets
Text document is represent ed by t he words it cont ains (and t heir
occurrences)
e.g., Lord of t he rings { t he , Lord , rings , of }
Highly efficient
Makes learning far simpler and easier
12
Text Pr e- pr ocessi ng
St rip unwant ed charact ers/ markup (e.g. HTML t ags,
punct uat ion, numbers, et c.).
Break int o t okens (keywords) on whit espace. (Lexical analysis)
St emming: ident ifies a word by it s root
e.g., flying, flew fly
Reduces t he dimensionalit y
St op words: The most common words are unlikely t o help t ext
mining
e.g., t he , a , an , you
St opwor ds r emoval
Many of t he most frequent ly used words in English are useless in
I R and t ext mining t hese words are called st op words.
t he, of, and, t o, .
Typically about 400 t o 500 such words
For an applicat ion, an addit ional domain specific
st opwords list may be const ruct ed
Why do we need t o remove st opwords?
Reduce indexing (or dat a) file size
st opwords account s 20-30% of t ot al word count s.
I mprove efficiency and effect iveness
st opwords are not useful for searching or t ext
mining
t hey may also confuse t he ret rieval syst em.
13
St emmi ng
Techniques used t o find out t he root / st em of a word.
E.g.,
user engineering
users engineered
used engineer
using
st em: use engineer
Usef ul ness:
improving effect iveness of I R and t ext mining
mat ching similar words
Mainly improve recall
reducing indexing size
combing words wit h same root s may reduce
indexing size as much as 40-50%.
Exampl e St op Li st
14
Vect or Space Model
Document s and user queries are represent ed as m-dimensional
vect ors, where m is t he t ot al number of index t erms in t he
document collect ion.
The degree of similarit y of t he document d wit h regard t o t he query
q is calculat ed as t he correlat ion bet ween t he vect ors t hat
represent t hem, using measures such as t he Euclidian dist ance or
t he cosine of t he angle bet ween t hese t wo vect ors.
Vect or Space Model
Represent a doc by a t erm vect or
Term: basic concept , e.g., word or phrase
Each t erm defines one dimension
N t erms define a N-dimensional space
Element of vect or corresponds t o t erm weight
E.g., d = (x
1
,,x
N
), x
i
is import ance of t erm i
New document is assigned t o t he most likely cat egory
based on vect or similarit y.
15
Saltons Vector Space Model
Gerald Salton
60 70
Represent each document by a high-
dimensional vector in the space of words
Vect or Space Model
How t o det ermine import ant words in a document ?
Word n-grams (and phrases, idioms,) t erms
How t o det ermine t he degree of import ance of a t erm
wit hin a document and wit hin t he ent ire collect ion?
How t o det ermine t he degree of similarit y bet ween a
document and t he query?
16
Document Col l ect i on
A collect ion of n document s can be represent ed in t he vect or space
model by a t erm-document mat rix.
An ent ry in t he mat rix corresponds t o t he weight of a t erm in t he
document ; zero means t he t erm has no significance in t he document
or it simply doesnt exist in t he document .
T
1
T
2
. T
t
D
1
w
11
w
21
w
t1
D
2
w
12
w
22
w
t2
: : : :
: : : :
D
n
w
1n
w
2n
w
tn
How t o Assi gn Wei ght s
Two-fold weight s based on frequency
TF (Term frequency)
More frequent wi t hi n a document more relevant
t o semant ics
e.g., query vs. commercial
I DF (I nverse document frequency)
Less frequent among document s more
discriminat ive
e.g. algebra vs. science
17
Ter m Wei ght s: Ter m Fr equency
More frequent t erms in a document are more import ant ,
i.e. more indicat ive of t he t opic
f
ij
= frequency of t erm i in document j
May want t o normalize t erm frequency (t f) across t he ent ire
corpus because t he document lengt h may vary
Ter m Wei ght s:
I nver se Document Fr equency
Terms t hat appear in many different document s are less
indicat ive of overall t opic.
df
i
= document frequency of t erm i
= number of document s cont aining t erm i
idf
i
= inverse document frequency of t erm i,
= log
2
(N/ df
i
)
(N: t ot al number of document s)
An indicat ion of a t erms discriminat ive power.
Log used t o dampen t he effect relat ive t o t f.
18
TF-I DF Wei ght i ng
A t ypical combined t erm import ance indicat or is
t f-idf weight ing:
w
ij
= t f
ij
* idf
i
= t f
ij
log
2
(N/ df
i
)
A t erm occurring frequent ly in t he document but
rarely in t he rest of t he collect ion is given high
weight .
Many ot her ways of det ermining t erm weight s
have been proposed.
Experiment ally, t f-idf has been found t o work well.
Comput i ng TF- I DF - - An Exampl e
Given a document cont aining t erms wit h given frequencies:
A(3), B(2), C(1)
Assume collect ion cont ains 10,000 document s and
document frequencies of t hese t erms are:
A(50), B(1300), C(250)
Then:
A: t f = 3/ 3; idf = log(10000/ 50) = 5.3; t f -idf = 5.3
B: t f = 2/ 3; idf = log(10000/ 1300) = 2.0; t f -idf = 1.3
C: t f = 1/ 3; idf = log(10000/ 250) = 3.7; t f -idf = 1.2
19
TF- I DF Wei ght i ng
TF-I DF weight ing : wei ght ( t , d) = TF( t , d) * I DF( t )
Frequent wit hin doc high t f high weight
Select ive among docs high idf high weight
Recall VS model
Each select ed t erm represent s one dimension
Each doc is represent ed by a feat ure vect or
I t s t -t erm coordinat e of document d is t he TF-I DF
weight
This is a simple yet effect ive scheme. Many complex and
more effect ive weight ing variant s exist in pract ice
Query Vect or
Query vect or is t ypically t reat ed as a
document and also t f -idf weight ed.
Alt ernat ively, t he user can also supply weight s
for t he given query t erms.
20
Si mi l ari t y Measure
A similarit y measure is a funct ion t hat
comput es t he degree of similarit y bet ween
t wo vect ors.
Using a similarit y measure bet ween t he query
and each document :
I t is possible t o rank t he ret rieved
document s in t he order of presumed
relevance.
I t is possible t o enforce a cert ain t hreshold
so t hat t he size of t he ret rieved set can be
cont rolled.
How t o Measur e Si mi l ar i t y?
Given t wo document
Cosine Similarit y (or normalized dot product )
measures dist ance bet ween document vect ors
small angle = large cosine = similar
large angle = small cosine = dissimilar
21
Si mi l ar i t y Measur e - I nner Pr oduct
Similarit y bet ween vect ors for t he document d
i
and query q can be
comput ed as t he vect or inner product :
sim( d
j
,q) = d
j
q = w
ij
w
iq
where w
ij
is t he weight of t erm i in document j and
w
iq
is t he weight of t erm i in t he query
For binary vect ors, t he inner product is t he number of mat ched
query t erms in t he document (size of int ersect ion).
For weight ed t erm vect ors, it is t he sum of t he product s of t he
weight s of t he mat ched t erms.
Assymet r i c At t r i but es: Measures how many t erms mat ched but
not how many t erms are not mat ched.

=
t
i 1
I nner Product -- Exampl es
Binary:
D = 1, 1, 1, 0, 1, 1, 0
Q = 1, 0 , 1, 0, 0, 1, 1
sim(D, Q) = 3
Size of vector = size of vocabulary = 7
0 means corresponding term not found in
document or query
Weighted:
D
1
= 2T
1
+ 3T
2
+ 5T
3
D
2
= 3T
1
+ 7T
2
+ 1T
3
Q = 0T
1
+ 0T
2
+ 2T
3
sim(D
1
, Q) = 2*0 + 3*0 + 5*2 = 10
sim(D
2
, Q) = 3*0 + 7*0 + 1*2 = 2
22
Graphi c Represent at i on
Example:
D
1
= 2T
1
+ 3T
2
+ 5T
3
D
2
= 3T
1
+ 7T
2
+ T
3
Q = 0T
1
+ 0T
2
+ 2T
3
T
3
T
1
T
2
D
1
= 2T
1
+ 3T
2
+ 5T
3
D
2
= 3T
1
+ 7T
2
+ T
3
Q = 0T
1
+ 0T
2
+ 2T
3
7
3 2
5
Is D
1
or D
2
more similar to Q?
How to measure the degree of
similarity? Distance? Angle?
Projection?
Cosi ne Si mi l ari t y Measure
Cosine similarit y measures t he cosine of
t he angle bet ween t wo vect ors.
I nner product normalized by t he vect or
lengt hs.
D
1
= 2T
1
+ 3T
2
+ 5T
3
CosSim(D
1
, Q) = 10 / \(4+9+25)(0+0+4) = 0.81
D
2
= 3T
1
+ 7T
2
+ 1T
3
CosSim(D
2
, Q) = 2 / \(9+49+1)(0+0+4) = 0.13
Q = 0T
1
+ 0T
2
+ 2T
3

2
t
3
t
1
t
2
D
1
D
2
Q

1
D
1
is 6 times better than D
2
using cosine similarity but only 5 times better using
inner product.

= =
=

t
i
t
i
t
i
w w
w w
q d
q d
iq ij
iq ij
j
j
1 1
2 2
1
) (

CosSim(d
j
, q) =
23
Some Pr obl ems
Reduce dimensionalit y
DM Algorit hms have difficult y addressing high dimensionalit y t asks
I rrelevant feat ures
Not all feat ures help!
e.g., t he exist ence of a noun in a news art icle is unlikely t o help
classify it as polit ics or sport
synonymy: many ways t o refer t o t he same obj ect ,
e.g. car and aut omobile, buy and purchase,
number of words wit h same meaning t erm mat ching misses relevant
document s leads t o poor recall
pol ysemy: most words have more t han one dist inct meaning,
words wit h number of meanings t erm mat ching ret urns irrelevant
document s leads t o poor precision
e.g.model, pyt hon, Hilt on
Lexi cal mat chi ng at t er m l evel i naccur at e and hence we need a
met hod t hat model s concept s and not j ust wor ds
LOOKING FOR WHAT?
Search
Paris Hilton
Really interested in The Hilton Hotel in Paris?
Tiger Woods
Searching something about wildlife or the famous
golf player?
Simple word matching fails
24
Mappi ng t o Concept Space
Latent Semantic Indexing
Concepts instead of words
Mathematical model
relates documents and the concepts
Looks for concepts in the documents
Stores them in a concept space
related documents are connected to form a concept space
Do not need an exact match for the query
25
How to obtain Concept Space?
Domain Expert: One possible way would be to find canonical
representations of natural language
difficult task to achieve even for expert
Algebraic: Latent Semantic Indexing (LSI)
use mathematical properties of the term document matrix,
i.e. Singular Value Decomposition (SVD) can determine the concepts by
matrix computation.
Key I dea: map document s and queries int o a lower dimensional
space composed of higher level concept s which are fewer in number
t han t he index t erms
Dimensionalit y reduct ion: Ret rieval (and clust ering) in a reduced
concept space might be superior t o ret rieval in t he high-dimensional
space of index t erms
GOOGLE USES LSI
ranking pages
~ sign before the search term stands for the semantic
search
~phone
the first link appearing is the page for Nokia
although page does not contain the word phone
~humor
retrieved pages contain its synonyms; comedy, jokes,
funny
26
Lat ent Semant i c I ndexi ng
We would like a represent at ion in which a set of t erms,
which by it self is incomplet e and unreliable evidence of
t he relevance of a given document , is replaced by some
ot her set of ent it ies which are more reliable indicant s. We
t ake advant age of t he implicit higher-order (or lat ent )
st ruct ure in t he associat ion of t erms and document s t o
reveal such relat ionships.
Fundament al Paper: Deerwest er, S., Dumais, S. T.,
Landauer, T. K., Furnas, G. W. and Harshman, R.A. (1990)
"I ndexing by lat ent semant ic analysis." Journal of t he
Societ y for I nformat ion Science, 41(6), 391-407.
Lat ent Semant i c Anal ysi s ( LSA)
LSA aims to discover something about the meaning
behind the words; about the topics in the documents.
What is the difference between topics and words?
Words are observable
Topics are not. They are latent.
How to find out topics from the words in an automatic
way?
We can imagine them as a compression of words
A combination of words
Try to formalise this
27
SVD Basi cs
A met hod for rot at ing t he axes in n-dimensional space, so t hat t he first axis
runs along t he direct ion of t he largest variat ion among t he document s
t he second dimension runs along t he direct ion wit h t he second largest
variat ion SVD of t he t erm-by-document mat rix X:
I f t he singular values of S0 are ordered by size, we only keep t he first k largest
values and get a reduced model:
doesnt exact ly mat ch X and it get s closer as more and more singular
values are kept
I t reflect s t he maj or associat ive pat t erns in t he dat a, and ignores t he
smaller, less import ant influence and noise.
produces k-dimensional approximat ion of t he original mat rix (in least -squares
sense) - t his is t he semant ic space
Comput e similarit ies bet ween ent it ies in semant ic space (usually wit h
cosine)
'

TSD X =
A Simple Example
c1: Human machine int erface for ABC comput er applicat ions
c2: A survey of user opinion of comput er syst em response t ime
c3: The EPS user int erface management syst em
c4: Syst em and human syst em engineering t est ing of EPS
c5: Relat ion of user perceived response t ime t o error measurement
m1: The generat ion of random, binary, ordered t rees
m2: The int ersect ion graph of pat hs in t rees
m3: Graph minors I V: Widt hs of t rees and well-quasi-ordering
m4: Graph minors: A survey
Techni cal Memo Ti t l es
Index terms are italicized
28
c1 c2 c3 c4 c5 m1 m2 m3 m4
human 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
response 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
EPS 0 0 1 1 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
r (human.user) = -.38 r (human.minors) = -.29
A Simple Example
Break down t he mat rices
{ A} = { U} { S} { V}
T=
{ T} { S} { D}
T
T = t erm; S = singular; D = document
Dimension Reduction
Si ngul ar Val ue Decomposi t i on
29
r (human.user) = .94 r (human.minors) = -.83
c1 c2 c3 c4 c5 m1 m2 m3 m4
human 0.16 0.40 0.38 0.47 0.18 -0.05 -0.12 -0.16 -0.09
interface 0.14 0.37 0.33 0.40 0.16 -0.03 -0.07 -0.10 -0.04
computer 0.15 0.51 0.36 0.41 0.24 0.02 0.06 0.09 0.12
user 0.26 0.84 0.61 0.70 0.39 0.03 0.08 0.12 0.19
system 0.45 1.23 1.05 1.27 0.56 -0.07 -0.15 -0.21 -0.05
response 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22
time 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22
EPS 0.22 0.55 0.51 0.63 0.24 -0.07 -0.14 -0.20 -0.11
survey 0.10 0.53 0.23 0.21 0.27 0.14 0.31 0.44 0.42
trees -0.06 0.23 -0.14 -0.27 0.14 0.24 0.55 0.77 0.66
graph -0.06 0.34 -0.15 -0.30 0.20 0.31 0.69 0.98 0.85
minors -0.04 0.25 -0.10 -0.21 0.15 0.22 0.50 0.71 0.62
Lat ent Semant i c I ndexi ng
m x n
terms
documents
X =
m x r
r x r r x n
*
*
*
*
*
D
S
T
0
0 0
Singular Value Decomposition
m x n m x k
k x k
k x n
=
terms
documents
*
*
*
*
* X
D
S
T
Select first k singular values
^
30
SVD wi t h mi nor t er ms dr opped
TS define coordinates
for documents in latent
space
Ter ms Gr aphed i n Two Di mensi ons
LSA2.SVD.2dimTrmVectors[,1]
-2.0 -1.5 -1.0 -0.5 0.0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
human
interface
computer
user
system
response time
EPS
survey
trees
graph
minors
31
Document s and Ter ms
LSA2.SVD.2dimTrmVectors[,1]
-2.0 -1.5 -1.0 -0.5 0.0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
human
interface
computer
user
system
response time
EPS
survey
trees
graph
minors
c1
c2
c3
c4
c5
m1
m2
m3
m4
c1
c2
c3
c4
c5
m1
m2
m3
m4
c1
c2
c3
c4
c5
m1
m2
m3
m4
c1
c2
c3
c4
c5
m1
m2
m3
m4
c1
c2
c3
c4
c5
m1
m2
m3
m4
c1
c2
c3
c4
c5
m1
m2
m3
m4
c1
c2
c3
c4
c5
m1
m2
m3
m4
c1
c2
c3
c4
c5
m1
m2
m3
m4
c1
c2
c3
c4
c5
m1
m2
m3
m4
c1
c2
c3
c4
c5
m1
m2
m3
m4
c1
c2
c3
c4
c5
m1
m2
m3
m4
Change i n Text Cor r el at i on
Correlations betweentext inrawdata
c1 c2 c2 c4 c5 m1 m2 m3 m4
c1 1.000
c2 -0.192 1.000
c3 0.000 0.000 1.000
c4 0.000 0.000 0.472 1.000
c5 -0.333 0.577 0.000 -0.309 1.000
m1 -0.174 -0.302 -0.213 -0.161 -0.174 1.000
m2 -0.258 -0.447 -0.316 -0.239 -0.258 0.674 1.000
m3 -0.333 -0.577 -0.408 -0.309 -0.333 0.522 0.775 1.000
m4 -0.333 -0.192 -0.408 -0.309 -0.333 -0.174 0.258 0.556 1.000
Correlations intwo-dimensional space
c1 c2 c2 c4 c5 m1 m2 m3 m4
c1 1.000
c2 0.910 1.000
c3 1.000 0.912 1.000
c4 0.998 0.884 0.998 1.000
c5 0.842 0.990 0.844 0.809 1.000
m1 -0.858 -0.568 -0.856 -0.887 -0.445 1.000
m2 -0.853 -0.562 -0.851 -0.883 -0.438 1.000 1.000
m3 -0.852 -0.559 -0.850 -0.881 -0.435 1.000 1.000 1.000
m4 -0.811 -0.497 -0.809 -0.845 -0.368 0.996 0.997 0.997 1.000
32
Summar y of LSI
Some issues
Finding opt imal dimension for semant ic space
precision-recall improve as dimension is increased
unt il hit s opt imal, t hen slowly decreases unt il it hit s
st andard vect or model
run SVD once wit h big dimension, say k = 1000
t hen can t est dimensions < = k
in many t asks 150-350 works well, st ill room for
research
Types of Text Dat a Mi ni ng
Keyword-based associat ion analysis
Aut omat ic document classificat ion
Similarit y det ect ion
Clust er document s by a common aut hor
Clust er document s cont aining informat ion from a
common source
Lit erat ure Mining: Finding new t hings from research papers
Anomaly det ect ion: find informat ion t hat violat es usual
pat t erns
Hypert ext analysis
33
Text Cat egor i zat i on
Pre-given cat egories and labeled document
examples (Cat egories may form hierarchy)
Classify new document s
A st andard classificat ion (supervised learning )
problem Mult i-label Classificat ion
Categorization
System

Sports
Business
Education
Science

Sports
Business
Education
Document Cl ust er i ng
Mot ivat ion
Aut omat ically group relat ed document s based on t heir
cont ent s
No predet ermined t raining set s or t axonomies
Generat e a t axonomy at runt ime
Clust ering Process
Dat a preprocessing: remove st op words, st em, feat ure
ext ract ion, lexical analysis, et c.
Hierarchical clust ering: comput e similarit ies applying
clust ering algorit hms.
Model-Based clust ering
34
Mi n- Apr i or i ( Han et al )
TID W1 W2 W3 W4 W5
D1 2 2 0 0 1
D2 0 0 1 2 2
D3 2 3 0 0 0
D4 0 0 1 0 1
D5 1 1 1 0 2
Example:
W1 and W2 tends to appear together in the same document
(W1 Data and W2 Mining)
Document-term matrix:
Mi n- Apr i or i
Dat a cont ains only cont inuous at t ribut es of t he
same t ype
e.g., frequency of words in a document
Pot ent ial solut ion:
Convert int o 0/ 1 mat rix and t hen apply exist ing
algorit hms
lose word frequency informat ion
TID W1 W2 W3 W4 W5
D1 2 2 0 0 1
D2 0 0 1 2 2
D3 2 3 0 0 0
D4 0 0 1 0 1
D5 1 1 1 0 2
35
Mi n- Apr i or i
How t o det ermine t he support of a word?
I f we simply sum up it s frequency, support
count will be great er t han t ot al number of
document s!
Normalize t he word vect ors e.g., using L
1
norm
Each word has a support equals t o 1.0
TID W1 W2 W3 W4 W5
D1 2 2 0 0 1
D2 0 0 1 2 2
D3 2 3 0 0 0
D4 0 0 1 0 1
D5 1 1 1 0 2
TID W1 W2 W3 W4 W5
D1 0.40 0.33 0.00 0.00 0.17
D2 0.00 0.00 0.33 1.00 0.33
D3 0.40 0.50 0.00 0.00 0.00
D4 0.00 0.00 0.33 0.00 0.17
D5 0.20 0.17 0.33 0.00 0.33
Normalize
Mi n- Apr i or i
New definit ion of support :

e e
=
T i C j
j i D C ) , ( ) sup( min
Example:
Sup(W1,W2,W3)
= 0 + 0 + 0 + 0 + 0.17
= 0.17
TID W1 W2 W3 W4 W5
D1 0.40 0.33 0.00 0.00 0.17
D2 0.00 0.00 0.33 1.00 0.33
D3 0.40 0.50 0.00 0.00 0.00
D4 0.00 0.00 0.33 0.00 0.17
D5 0.20 0.17 0.33 0.00 0.33
36
Ant i - monot one pr oper t y of Suppor t
Example:
Sup(W1) = 0.4 + 0 + 0.4 + 0 + 0.2 = 1
Sup(W1, W2) = 0.33 + 0 + 0.4 + 0 + 0.17 = 0.9
Sup(W1, W2, W3) = 0 + 0 + 0 + 0 + 0.17 = 0.17
TID W1 W2 W3 W4 W5
D1 0.40 0.33 0.00 0.00 0.17
D2 0.00 0.00 0.33 1.00 0.33
D3 0.40 0.50 0.00 0.00 0.00
D4 0.00 0.00 0.33 0.00 0.17
D5 0.20 0.17 0.33 0.00 0.33
Text Cl assi f i cat i on
Mot ivat ion
Aut omat ic classificat ion for t he large number of on-line t ext
document s (Web pages, e-mails, corporat e int ranet s, et c.)
Classificat ion Process
Dat a preprocessing
Definit ion of t raining set and t est set s
Creat ion of t he classificat ion model using t he select ed
classificat ion algorit hm
Classificat ion model validat ion
Classificat ion of new/ unknown t ext document s
37
An Over vi ew
I nt roduct ion t o I nformat ion Ret rieval
Text Pre-processing
Ret rieval Models
Comput ing Similarit y
Lat ent Semant ic I ndexing
The Mining Tasks

S-ar putea să vă placă și