Sunteți pe pagina 1din 30

09‐11‐2016

Advanced Analytical Theory 
and methods: Text Analysis 

Learning Objectives
• Describe text mining and understand the need for 
text mining
• Differentiate between text mining and data mining
• Understand the different application areas for text 
mining
• Know the process of carrying out a text mining 
project
• Understand the different methods to introduce 
structure to text‐based data

1
09‐11‐2016

Motivation

2
09‐11‐2016

Text Mining Concepts
• 85‐90 percent of all corporate data is in some kind 
of unstructured form (e.g., text) 
• Unstructured corporate data is doubling in size 
every 18 months
• Tapping into these information sources is not an 
option, but a need to stay competitive
• What is text mining?
• A semi‐automated process of extracting knowledge 
from unstructured data sources
• a.k.a. text data mining or knowledge discovery in 
textual databases

Introduction 
• Also called Text Analytics 
• Text Mining forms a core step in text analysis
• Examples of text data 
News article
Literature
E‐mail
Web Pages 
Server Logs 
Social Network API forehorses
Call center transcripts 

3
09‐11‐2016

Data sources with formats 

Data Mining versus Text Mining
• Both seek for novel and useful patterns
• Both are semi‐automated processes
• Difference is the nature of the data: 
• Structured versus unstructured data
• Structured data: in databases
• Unstructured data: Word documents, PDF files, text 
excerpts, XML files, and so on
• Text mining – first, impose structure to the data, 
then mine the structured data 

4
09‐11‐2016

Text Mining Concepts
• Benefits of text mining are obvious especially in 
text‐rich data environments
• e.g., law (court orders), academic research (research 
articles), finance (quarterly reports), medicine (discharge 
summaries), biology (molecular interactions), technology 
(patent files), marketing (customer comments), etc.  
• Electronic communization records (e.g., Email)
• Spam filtering
• Email prioritization and categorization
• Automatic response generation

Text Mining
• Typically falls into one of two categories
• Analysis of text:  I have a bunch of text I am interested in, tell me something 
about it
• E.g. sentiment analysis, “buzz” searches
• Retrieval:  There is a large corpus of text documents, and I want the one 
closest to a specified query
• E.g. web search, library catalogs, legal and medical precedent studies

5
09‐11‐2016

Text Mining: Analysis
• Which words are most present
• Which words are most surprising
• Which words help define the document
• What are the interesting text phrases?

Text Mining:  Retrieval
• Find k objects in the corpus of documents which are most similar to 
my query. 
• Can be viewed as “interactive” data mining ‐ query not specified a 
priori.  
• Main problems of text retrieval:
• What does “similar” mean?
• How do I know if I have the right documents? 
• How can I incorporate user feedback?

6
09‐11‐2016

Text Mining Process
Context diagram for the text mining 
Software/hardware limitations
process  Privacy issues
Linguistic limitations

Unstructured data (text) Extract Context-specific knowledge


knowledge
from available
Structured data (databases) data sources
A0

Domain expertise
Tools and techniques

Text Mining Process
 

The three‐step text mining process 

7
09‐11‐2016

Text Mining Process
• Step 1: Establish the corpus
• Collect all relevant unstructured data          (e.g., textual documents, XML 
files, emails, Web pages, short notes, voice recordings…)
• Digitize, standardize the collection              (e.g., all in ASCII text files)
• Place the collection in a common place        (e.g., in a flat file, or in a directory 
as separate files) 

Text Mining Process
• Step 2: Create the Term–by–Document Matrix

8
09‐11‐2016

Text Mining Process
• Step 2: Create the Term–by–Document Matrix 
(TDM), cont.
• Should all terms be included?
• Stop words, include words
• Synonyms, homonyms
• Stemming
• What is the best representation of the indices (values in 
cells)? 
• Row counts; binary frequencies; log frequencies;
• Inverse document frequency

Text Mining Process
• Step 2: Create the Term–by–Document Matrix 
(TDM), cont.
• TDM is a sparse matrix. How can we reduce the 
dimensionality of the TDM?
• Manual ‐ a domain expert goes through it
• Eliminate terms with very few occurrences in very few 
documents (?)
• Transform the matrix using singular value decomposition (SVD) 
• SVD is similar to principle component analysis 

9
09‐11‐2016

Text Mining Process
• Step 3: Extract patterns/knowledge
• Classification (text categorization)
• Clustering (natural groupings of text)
• Improve search recall
• Improve search precision
• Scatter/gather
• Query‐specific clustering
• Association
• Trend Analysis (…)

Text Retrieval: Challenges
• Calculating similarity is not obvious ‐ what is the distance 
between two sentences or queries? 
• Evaluating retrieval is hard:  what is the “right” answer ? (no 
ground truth)
• User can query things you have not seen before e.g. misspelled, 
foreign, new terms.
• Goal (score function) is different than in 
classification/regression:  not looking to model all of the data, 
just get best results for a given user.
• Words can hide semantic content
• Synonymy: A keyword T does not appear anywhere in the document, 
even though the document is closely related to T, e.g., data mining
• Polysemy: The same keyword may mean different things in different 
contexts, e.g., mining

10
09‐11‐2016

Basic Measures for Text Retrieval

• Precision: the percentage of retrieved documents that are 
in fact relevant to the query (i.e., “correct” responses)
| {Relevant}  {Retrieved} |
precision 
| {Retrieved} |

• Recall: the percentage of documents that are relevant to 
the query and were, in fact, retrieved
|{Relevant}  {Retrieved} |
recall 
|{Relevant} |

Precision vs. Recall
Truth:Relvant Truth:Not Relevant

Algorithm:Relevant TP FP

Algorithm: Not Relevant FN TN

• We’ve been here before!
• Precision = TP/(TP+FP)
• Recall = TP/(TP+FN) actual
• Trade off:  1
outcome 0
• If algorithm is ‘picky’: precision high, recall low
• If algorithm is ‘relaxed’: precision low, recall high
predicted
1 a b
• BUT: recall often hard if not impossible to calculate outcome

0 c d

11
09‐11‐2016

Precision Recall Curves
• If we have a labelled training set, we can calculate recall.
• For any given number of returned documents,  we can plot a point for 
precision vs. recall.  (similar to thresholds in ROC curves)
• Different retrieval algorithms might have very different curves ‐ hard 
to tell which is “best”

Term / document matrix
• Most common form of representation in text mining is the term ‐
document matrix
• Term: typically a single word, but could be a word phrase like “data mining”
• Document:  a generic term meaning a collection of text to be retrieved
• Can be large ‐ terms are often 50k or larger, documents can be in the billions 
(www). 
• Can be binary, or use counts

12
09‐11‐2016

Term document matrix
Example: 10 documents: 6 terms

Database SQL Index Regression Likelihood linear


D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23

• Each document now is just a vector of terms, 
sometimes boolean

Term document matrix
• We have lost all semantic content

• Be careful constructing your term list!
• Not all words are created equal!
• Words that are the same should be treated the same!

• Stop Words
• Stemming

13
09‐11‐2016

Stop words
• Many of the most frequently used words in English are worthless in 
retrieval and text mining – these words are called stop words.
• the, of, and, to, ….
• Typically about  400 to 500 such words
• For an application, an additional domain specific stop words list may be 
constructed
• Why do we need to remove stop words?
• Reduce indexing (or data) file size
• stopwords accounts 20‐30% of total word counts.
• Improve efficiency
• stop words are not useful for searching or text mining
• stop words always have a large number of hits

Stemming
• Techniques used to find out the root/stem of a word:
• E.g.,
• user             engineering
• users             engineered               
• used                   engineer                  
• using          
• stem:      use                   engineer
Usefulness
• improving effectiveness of retrieval and text mining 
• matching similar words
• reducing indexing size
• combing words with same roots may reduce indexing size as 
much as 40‐50%.

14
09‐11‐2016

Basic stemming methods

• remove ending
• if a word ends with a consonant other than s,
followed by an s, then delete s.
• if a word ends in es, drop the s.
• if a word ends in ing, delete the ing unless the remaining word consists only 
of one letter or of th.
• If a word ends with ed, preceded by a consonant, delete the ed unless this 
leaves only a single letter.
• …...
• transform words
• if a word ends with “ies” but not “eies” or “aies” then “ies ‐‐> y.”

Feature Selection
• Performance of text classification algorithms can be optimized 
by selecting only a subset of the discriminative terms
• Even after stemming and stopword removal. 

• Greedy search
• Start from full set and delete one at a time
• Find the least important variable
• Can use Gini index for this if a classification problem

• Often performance does not degrade even with orders of 
magnitude reductions
• Chakrabarti, Chapter 5:  Patent data: 9600 patents in communcation, 
electricity and electronics.
• Only 140 out of 20,000 terms needed for classification!

15
09‐11‐2016

Distances in TD matrices
• Given a term doc matrix represetnation, now we can define distances 
between documents (or terms!)
• Elements of matrix can be 0,1 or term frequencies (sometimes 
normalized)
• Can use Euclidean or cosine distance
• Cosine distance is the angle between the two vectors
• Not intuitive, but has been proven to work well

• If docs are the same, dc =1, if nothing in common  dc=0

• We can calculate cosine and Euclidean distance for this 
matrix
• What would you want the distances to look like?

Database SQL Index Regression Likelihood linear


D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23

16
09‐11‐2016

Weighting in TD space
• Not all phrases are of equal importance
• E.g. David less important than Beckham
• If a term occurs frequently in many documents it has less discriminatory power
• One way to correct for this is inverse‐document frequency (IDF).  

• Term importance = Term Frequency (TF) x IDF
• Nj= # of docs containing the term
• N = total # of docs
• A term is “important” if it has a high TF and/or a high IDF.
• TF x IDF is a common measure of term importance

Database SQL Index Regression Likelihood linear

D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
Database SQL Index Regression Likelihood linear
D1 2.53 14.6 4.6 0 0 2.1
D2 3.3 6.7 2.6 0 1.0 0
D3 1.3 11.1 2.6 0 0 0
D4 0.7 4.9 1.0 0 0 0
D5 4.5 21.5 10.2 0 1.0 0

TF IDF D6 0.2 0 0 12.5 2.5 11.1


D7 0 0 0.5 22.2 4.3 0
D8 0.3 0 0 15.2 1.4 1.4
D9 0.1 0 0 23.56 9.6 17.3
D10 0.6 0 0 11.8 1.4 16.0

17
09‐11‐2016

Queries
• A query is a representation of the user’s information needs
• Normally a list of words. 
• Once we have a TD matrix, queries can be represented as a vector 
in the same space
• “Database Index”  = (1,0,1,0,0,0)
• Query can be a simple question in natural language

• Calculate cosine distance between query and the TF x IDF version 
of the TD matrix
• Returns a ranked vector of documents

Document Clustering
• Can also do clustering, or unsupervised learning of docs.
• Automatically group related documents based on their 
content.
• Require no training sets or predetermined taxonomies.
• Major steps
• Preprocessing
• Remove stop words, stem, feature extraction, lexical analysis, …
• Hierarchical clustering
• Compute similarities applying clustering algorithms, …
• Slicing
• Fan out controls, flatten the tree to desired number of levels.

• Like all clustering examples, success is relative

18
09‐11‐2016

Document Clustering
• To Cluster:
• Can use LSI
• Another model: Latent Dirichlet Allocation (LDA)
• LDA is a generative probabilistic model of a corpus. Documents are represented as random 
mixtures over latent topics, where a topic is characterized by a distribution over words.
• LDA:
• Three concepts: words, topics, and documents
• Documents are a collection of words and have a probability distribution over 
topics
• Topics have a probability distribution over words
• Fully Bayesian Model

Text Mining: Helpful Data
WordNet

Courtesy: Luca Lanzi

19
09‐11‐2016

Corpus 

Text Mining ‐ Other Topics
• Part of Speech Tagging
• Assign grammatical tags to words (verb, noun, etc)
• Helps in understanding documents : uses Hidden Markov Models

• Named Entity Classification
• Classification task: can we automatically detect proper nouns and tag them
• “Mr. Jones” is a person; “Madison” is a town.
• Helps with dis‐ambiguation: spears

20
09‐11‐2016

Text Mining ‐ Other Topics
• Sentiment Analysis
• Automatically determine tone in text: positive, negative or neutral
• Typically uses collections of good and bad words
• “While the traditional media is slowly starting to take John McCain’s straight talking image 
with increasingly large grains of salt, his base isn’t quite ready to give up on their favorite son. 
Jonathan Alter’s bizarre defense of McCain after he was caught telling an outright lie, 
perfectly captures that reluctance[.]”
• Often fit using Naïve Bayes

• There are sentiment word lists out there: 
• See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis

Text Mining ‐ Other Topics
• Summarizing text:   Word Clouds
• Takes text as input, finds the most 
interesting ones, and displays them 
graphically
• Blogs do this
• Wordle.net

21
09‐11‐2016

Text Mining Applications
• Marketing applications
• Enables better CRM
• Security applications
• ECHELON, OASIS
• Deception detection (…)
• Medicine and biology
• Literature‐based gene identification (…)
• Academic applications
• Research stream analysis

A bag of words

IT WAS the best of times, it was the worst of times,


it was the age of wisdom, it was the age of
foolishness, it was the epoch of belief, it was the
epoch of incredulity, it was the season of Light, it
was the season of Darkness, it was the spring of
hope, it was the winter of despair, we had
everything before us, we had nothing before us, we
were all going direct to Heaven, we were all going
direct the other way- in short, the period was so far
like the present period, that some of its noisiest
authorities insisted on its being received, for good or
for evil, in the superlative degree of comparison
only.

22
09‐11‐2016

A matrix
Words

11 12 1
21 22 2
Documents

1 2
#Example matrix syntax
A = matrix(c(1, rep(0,6), 2), nrow = 4)
library(slam)
S = simple_triplet_matrix(c(1, 4), c(1, 2), c(1, 2))
library(Matrix)
M = sparseMatrix(i = c(1, 4), j = c(1, 2), x = c(1, 2))

tm package
library(tm) #load the tm package
corpus_1 <- Corpus(VectorSource(txt)) # creates a ‘corpus’ from a vector

corpus_1 <- tm_map(corpus_1, content_transformer(tolower))


corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)

it was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it was
the epoch of incredulity, it was the season of light, it was the season of
darkness, it was the spring of hope, it was the winter of despair, we
had everything before us, we had nothing before us, we were all going
direct to heaven, we were all going direct the other way- in short, the
period was so far like the present period, that some of its noisiest
authorities insisted on its being received, for good or for evil, in the
superlative degree of comparison only.

23
09‐11‐2016

Stopwords
library(tm)
corpus_1 <- Corpus(VectorSource(txt))

corpus_1 <- tm_map(corpus_1, content_transformer(tolower))


corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)

it was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it
was the epoch of incredulity, it was the season of light, it was the
season of darkness, it was the spring of hope, it was the winter of
despair, we had everything before us, we had nothing before us, we
were all going direct to heaven, we were all going direct the other
way- in short, the period was so far like the present period, that
some of its noisiest authorities insisted on its being received, for
good or for evil, in the superlative degree of comparison only.

Stopwords
library(tm)
corpus_1 <- Corpus(VectorSource(txt))

corpus_1 <- tm_map(corpus_1, content_transformer(tolower))


corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)

best times, worst times, age wisdom, age foolishness, epoch


belief, epoch incredulity, season light, season darkness,
spring hope, winter despair, everything us, nothing us, going
direct heaven, going direct way- short, period far like present
period, noisiest authorities insisted received, good evil,
superlative degree comparison .

24
09‐11‐2016

Punctuation
library(tm)
corpus_1 <- Corpus(VectorSource(txt))

corpus_1 <- tm_map(corpus_1, content_transformer(tolower))


corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)

best times worst times age wisdom age foolishness epoch


belief epoch incredulity season light season darkness spring
hope winter despair everything us nothing us going direct
heaven going direct way short period far like present period
noisiest authorities insisted received good evil superlative degree
comparison

Stemming
library(tm)
corpus_1 <- Corpus(VectorSource(txt))

corpus_1 <- tm_map(corpus_1, content_transformer(tolower))


corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)

best time worst time age wisdom age foolish epoch


belief epoch incredul season light season dark spring hope
winter despair everyth us noth us go direct heaven go direct
way short period far like present period noisiest author insist
receiv good evil superl degre comparison

25
09‐11‐2016

Cleanup
library(tm)
corpus_1 <- Corpus(VectorSource(txt))

corpus_1 <- tm_map(corpus_1, content_transformer(tolower))


corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)

best time worst time age wisdom age foolish epoch belief epoch
incredul season light season dark spring hope winter despair everyth
us noth us go direct heaven go direct way short period far like present
period noisiest author insist receiv good evil superl degre comparison

Term Document Matrix


tdm <- TermDocumentMatrix(corpus_1)

<<TermDocumentMatrix (terms: 35, documents: 1)>>


Non-/sparse entries: 35/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)

class(tdm)
[1] "TermDocumentMatrix" "simple_triplet_matrix“

dim (tdm)
[1] 35 1

age 2 epoch 2 insist 1 short 1


author 1 everyth 1 light 1 spring 1
belief 1 evil 1 like 1 superl 1
best 1 far 1 noisiest 1 time 2
comparison 1 foolish 1 noth 1 way 1
dark 1 good 1 period 2 winter 1
degre 1 heaven 1 present 1 wisdom 1
despair 1 hope 1 receiv 1 worst 1
direct 2 incredul 1 season 2

26
09‐11‐2016

Ngrams
Library(Rweka)
four_gram_tokeniser <- function(x, n) {
RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 1, max = 4))
}

tdm_4gram <- TermDocumentMatrix(corpus_1,


control = list(tokenize = four_gram_tokeniser)))

dim(tdm_4gram)
[1] 163 1

age 2 author insist receiv good 1 dark 1


age foolish 1 belief 1 dark spring 1
age foolish epoch 1 belief epoch 1 dark spring hope 1
age foolish epoch belief 1 belief epoch incredul 1 dark spring hope winter 1
age wisdom 1 belief epoch incredul season 1 degre 1
age wisdom age 1 best 1 degre comparison 1
age wisdom age foolish 1 best time 1 despair 1
author 1 best time worst 1 despair everyth 1
author insist 1 best time worst time 1 despair everyth us 1
author insist receiv 1 comparison 1 despair everyth us noth 1

Text Mining Applications‐ Mining for Lies 
• Deception detection
• A difficult problem
• If detection is limited to only text, then the problem is even more difficult
• The study 
• analyzed text based testimonies of person of interests at military bases
• used only text‐based features (cues)

27
09‐11‐2016

Text Mining Applications
Mining for Lies

Text Mining Applications
Mining for Lies
Category Example Cues
Quantity Verb count, noun-phrase count, ...
Complexity Avg. no of clauses, sentence length, …
Uncertainty Modifiers, modal verbs, ...
Nonimmediacy Passive voice, objectification, ...
Expressivity Emotiveness
Diversity Lexical diversity, redundancy, ...
Informality Typographical error ratio
Specificity Spatiotemporal, perceptual information …
Affect Positive affect, negative affect, etc.
 

28
09‐11‐2016

Text Mining Applications
Mining for Lies
• 371 usable statements are generated
• 31 features are used
• Different feature selection methods used
• 10‐fold cross validation is used
• Results (overall % accuracy)
• Logistic regression 67.28
• Decision trees 71.60
• Neural networks 73.46

Text Mining Applications
(gene/protein interaction identification)
596 12043 24224 281020 42722 397276

D007962

D 016923

D 001773

D019254 D044465 D001769 D002477 D003643 D016158

...expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.

185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523

NN IN NN IN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP NP PP NP NP PP NP

29
09‐11‐2016

Text Mining Tools
• Commercial Software Tools
• SPSS PASW Text Miner
• SAS Enterprise Miner
• Statistica Data Miner
• ClearForest, …
• Free Software Tools
• RapidMiner
• GATE
• Spy‐EM, … 

30

S-ar putea să vă placă și