Sunteți pe pagina 1din 122

Business Intelligence and Data Mining

By Dr. Atanu Rakshit Email: atanu.rakshit@iimrohtak.ac.in atanu.raks@gmail.com

Business Intelligence and Data Mining (BI &DM)


Text Book:
Business Intelligence A Managerial Approach by Efraim Turban, Ramesh Sharda, Dursun Delen and Devid King, 2/e, Pearson, 2012

Reference Material:
Decision Support and Business Intelligence Systems by Efraim Turban, Ramesh Sharda and Dursun Delen, 9/e, Pearson, 2012

Business Intelligence and Data Mining (BI &DM)


Reference Material:
Business Intelligence Strategy A Practical Guide for Achieving BI Excellence by John Boyer, Bill Frank, Brian Green and Tracy Harris, MC Press, 2010 Business Analytics for Manager by Gert H. N. Laursen and Jesper Thorlund, Wiley, 2010

Business Intelligence and Data Mining (BI &DM) Sessions Plan


Introduction to Business Intelligence Decision Support Systems Concepts, Methodologies and Technologies Data Warehousing Business Performance Management Data Mining for Business Intelligence Text and Web Mining Business Intelligence: Implementation and Emerging Trends

Business Intelligence and Data Mining (BI &DM)

Introduction to Text and Web Mining

Learning Objectives
Describe text mining and understand the need for text mining Differentiate between text mining, Web mining and data mining Understand the different application areas for text mining Know the process of carrying out a text mining project Understand the different methods to introduce structure to text-based data

Learning Objectives
Describe Web mining, its objectives, and its benefits Understand the three different branches of Web mining
Web content mining Web structure mining Web usage mining

Understand the applications of these three mining paradigms

Opening Vignette
Mining Text For Security And Counterterrorism What is MITRE? Problem description Proposed solution Results Answer & discuss the case questions

Opening Vignette: Mining Text For Security


Cluster 1 (L) Kampala (L) Uganda (P) Yoweri Museveni (L) Sudan (L) Khartoum (L) Southern Sudan Cluster 2 (P) Timothy McVeigh (L) Oklahoma City (P) Terry Nichols Cluster 3 (E) election (P) Norodom Ranariddh (P) Norodom Sihanouk (L) Bangkok (L) Cambodia (L) Phnom Penh (L) Thailand (P) Hun Sen (O) Khmer Rouge (P) Pol Pot

Text Mining Concepts


85-90 percent of all corporate data is in some kind of unstructured form (e.g., text). Unstructured corporate data is doubling in size every 18 months. Tapping into these information sources is not an option, but a need to stay competitive. Answer: text mining
A semi-automated process of extracting knowledge from unstructured data sources text data mining or knowledge discovery in textual databases

Data Mining versus Text Mining


Both seek novel and useful patterns Both are semi-automated processes Difference is the nature of the data:
Structured versus unstructured data Structured data: databases Unstructured data: Word documents, PDF files, text excerpts, XML files, and so on

Text mining first, impose structure to the data, then mine the structured data

Text Mining Concepts


Benefits of text mining are obvious especially in text-rich data environments
e.g., law (court orders), academic research (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), marketing (customer comments), etc.

Electronic communication records (e.g., Email)


Spam filtering Email prioritization and categorization Automatic response generation

Challenges
Information is an unstructured textual form Large textual database Almost all publications are also in electronic form Very high number of possible dimensions All possible word and phrase type in the language Complex and subtle relationships between concepts in text AOL merges with Time-Warner Time-Warner is bought by AOL Word ambiguity and context sensitivity Apple (the computer) or Apple (the fruit) Noisy Data Examples Spelling mistakes

What is Text-Mining?
finding interesting regularities in large textual datasets (adapted from Usama Fayad)
where interesting means: non-trivial, hidden, previously unknown and potentially useful

finding semantic and abstract information from the surface form of textual data

Why dealing with Text is Tough?


97)

(M.Hearst

Abstract concepts are difficult to represent Countless combinations of subtle, abstract relationships among concepts Many ways to represent similar concepts
E.g. space ship, flying saucer, UFO

Concepts are difficult to visualize High dimensionality Tens or hundreds of thousands of features

Why dealing with Text is Easy?


Highly redundant data
most of the methods count on this property

(M.Hearst 97)

Just about any simple algorithm can get good results for simple tasks:
Pull out important phrases Find meaningfully related words Create some sort of summary from documents

Semi-Structured Data
Text databases are, in general, semi-structured Example:
Title Author Publication_Date Length Category Abstruct Content

Structured attributes/value pair

Unstructured

Text Mining Process


Text preprocessing
Syntactic/Semantic text analysis

Features Generation
Bag of words

Features Selection
Simple counting Statistics

Text/Data Mining
Classification Clustering Associations

Analyzing results

Who is in the text analysis arena?


Knowledge Rep. & Reasoning / Tagging Search & DB

Computational Linguistics

Data Analysis

What dimensions are in text analytics?


Three major dimensions of text analytics:
Representations
from character-level to first-order theories

Techniques
from manual work, over learning to reasoning

Tasks
from search, over (un-, semi-) supervised learning, to visualization, summarization, translation

Text-Mining

How do we represent text?

Levels of text representations


Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories

Levels of text representations


Character Words Phrases Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality

Character level
Character level representation of a text consists from sequences of characters
a document is represented by a frequency distribution of sequences Usually we deal with contiguous strings each character sequence of length 1, 2, 3, represent a feature with its frequency

Good and bad sides


Representation has several important strengths:
it is very robust since avoids language morphology (useful for e.g. language identification) it captures simple patterns on character level (useful for e.g. spam detection, copy detection) because of redundancy in text data it could be used for many analytic tasks (learning, clustering, search) It is used as a basis for string kernels in combination with SVM for capturing complex character sequence patterns

for deeper semantic tasks, the representation is too weak

Levels of text representations


Character Words Phrases Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality

Word level
The most common representation of text used for many techniques
there are many tokenization software packages which split text into the words

Important to know:
Word is well defined unit in western languages e.g. Chinese has different notion of semantic unit

Words Properties
Relations among word surface forms and their senses: Homonomy: same form, but different meaning (e.g. bank: river bank, financial institution) Polysemy: same form, related meaning (e.g. bank: blood bank, financial institution) Synonymy: different form, same meaning (e.g. singer, vocalist) Hyponymy: one word denotes a subclass of an another (e.g. breakfast, meal)

Word frequencies in texts have power distribution: small number of very frequent words big number of low frequency words

Document Representation
Stop Word Removal: Many word are not informative and thus irrelevant for document representation The, and, a, an, is, of, that, . Stemming: Reducing words to their root form A document may contain several occurrences of word like Fish, fishes, fisher, fishers, . But would not retrieved by a query with keyword Fishing Different words share trhe same word stem and should represented with its stem, instead of actual word fish

Levels of text representations


Character Words Phrases Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality

Phrase level
Instead of having just single words we can deal with phrases We use two types of phrases:
Phrases as frequent contiguous word sequences Phrases as frequent non-contiguous word sequences both types of phrases could be identified by simple dynamic programming algorithm

The main effect of using phrases is to more precisely identify sense

Levels of text representations


Character Words Phrases Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality

Part-of-Speech level
By introducing part-of-speech tags we introduce wordtypes enabling to differentiate words functions
For text-analysis part-of-speech information is used mainly for information extraction where we are interested in e.g. named entities which are noun phrases Another possible use is reduction of the vocabulary (features) it is known that nouns carry most of the information in text documents

Part-of-Speech taggers are usually learned by HMM algorithm on manually tagged data

Part-of-Speech Table

http://www.englishclub.com/grammar/parts-of-speech_1.htm

Part-of-Speech examples

http://www.englishclub.com/grammar/parts-of-speech_2.htm

Levels of text representations


Character Words Phrases Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality

Taxonomies/thesaurus level
Thesaurus has a main function to connect different surface word forms with the same meaning into one sense (synonyms)
additionally we often use hypernym relation to relate generalto-specific word senses by using synonyms and hypernym relation we compact the feature vectors

The most commonly used general thesaurus is WordNet which exists in many other languages (e.g. EuroWordNet)
http://www.illc.uva.nl/EuroWordNet/

WordNet database of lexical relations


WordNet is the most well developed and widely used lexical database for English
it consist from 4 databases (nouns, verbs, adjectives, and adverbs)
Category Unique Forms 94474 10319 Number of Senses 116317 22066

Noun Verb

Each database consists from sense entries each sense consists from a set of synonyms, e.g.:
musician, instrumentalist, player person, individual, someone life form, organism, being

Adjective
Adverb

20170
4546

29881
5677

WordNet excerpt from the graph


chicken Is_a clean Is_a preen Is_a smooth Typ_obj Typ_subj Means chatter Is_a make gaggle Classifier peck number Is_a Means strike Is_a quack Typ_subj Is_a Typ_obj animal poultry Quesp hen Is_a Caused_by Is_a Not_is_a Is_a Is_a Is_a Part duck Typ_obj Purpose keep meat egg plant creature feather wing claw Part Typ_subj Is_a turtle mouth Is_a Is_a Is_a Is_a Is_a leg catch opening arm limb Purpose supply Typ_obj

sense
sound Is_a goose Typ_subj

bird

relation
Is_a

beak fly

sense
hawk

Part

Typ_obj

26 relations bill 116k sensesface

Is_a

Typ_subj Location

WordNet relations
Each WordNet entry is connected with other entries in the graph through relations Relations in the database of nouns:
Relation Hypernym Hyponym Has-Member Member-Of Has-Part Part-Of Antonym Definition From lower to higher concepts From concepts to subordinates From groups to their members From members to their groups From wholes to parts From parts to wholes Opposites Example breakfast -> meal meal -> lunch faculty -> professor copilot -> crew table -> leg course -> meal leader -> follower

Document Representation
A document representation aims to capture what the document is about One possible approach
Each entry describes a document Attribute describe whether or not a term appears in the document
Term Camera Document 1 Document 2 1 1 Digital 1 1 Memory 0 0 Pixel 1 0 -

Document Representation
Another approach
Each entry describe a document Attributes represent the frequency in which a term appears in the document

Example: Term frequency table


Term Camera Document 1 Document 2 3 0 Digital 2 4 Memory 0 0 Pixel 1 3 -

Levels of text representations


Character Words Phrases Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality

Vector-space model level


The most common way to deal with documents is first to transform them into sparse numeric vectors and then deal with them with linear algebra operations
by this, we forget everything about the linguistic structure within the text this is sometimes called structural curse because this way of forgetting about the structure doesnt harm efficiency of solving many relevant problems This representation is referred to also as Bag-Of-Words or Vector-Space-Model Typical tasks on vector-space-model are classification, clustering, visualization etc.

Bag-of-words document representation

Word weighting
In the bag-of-words representation each word is represented as a separate variable having numeric weight (importance) The most popular weighting schema is normalized word frequency TFIDF:

N tfidf ( w ) tf . log( ) df ( w )
Tf(w) term frequency (number of word occurrences in a document) Df(w) document frequency (number of documents containing the word) N number of all documents TfIdf(w) relative importance of the word in the document

The word is more important if it appears several times in a target document

The word is more important if it appears in less documents

Distance Based Matching


In order retrieve documents similar to a given document one need a measure of similarity Euclidean distance
The Euclidean distance between X = (x1, x2, x3, .., xn) and Y = (y1,y2,y3, .., yn) Is defined as

D(X,Y) = (xi yi)2

Similarity between document vectors


Each document is represented as a vector of weights D = <x> Cosine similarity (dot product) is the most widely used similarity measure between two document vectors
calculates cosine of the angle between document vectors efficient to calculate (sum of products of intersecting words) similarity value between 0 (different) and 1 (the same)

Sim ( D1 , D2 )

x
x2 j j
i

1i 2 i

xk2 k

Performance Measure
The set of retrieved documents can be formed by collecting the top-ranking documents according to a similarity measure The quality of a collection can be compared by the two following measures Relevant Relevant Precision = -----------------------------------Retrieved Relevant Retrieved Recall = -----------------------------------------Relevant
Relevant Documents Relevant & Retrieved Retrieved Documents

Classification techniques
Decision Tree Classification Bayesian Classifiers Neural Networks Statistical Analysis Genetic Algorithms Rough Set Approach k-nearest neighbor classifiers

Cluster Analysis for Data Mining


Analysis methods
Statistical methods (including both hierarchical and nonhierarchical), such as k-means, k-modes, and so on. Neural networks (adaptive resonance theory [ART], self-organizing map [SOM]) Fuzzy logic (e.g., fuzzy c-means algorithm) Genetic algorithms

Divisive versus Agglomerative methods

Text Mining for Patent Analysis (see Applications Case 7.2)


What is a patent?
exclusive rights granted by a country to an inventor for a limited period of time in exchange for a disclosure of an invention

How do we do patent analysis (PA)? Why do we need to do PA?


What are the benefits? What are the challenges?

How does text mining help in PA?

Natural Language Processing (NLP)


Structuring a collection of text
Old approach: bag-of-words New approach: natural language processing

NLP is
a very important concept in text mining. a subfield of artificial intelligence and computational linguistics. the study of "understanding" the natural human language.

Syntax versus semantics based text mining

Natural Language Processing (NLP)


What is Understanding ?
Human understands, what about computers? Natural language is vague, context driven True understanding requires extensive knowledge of a topic
Can/will computers ever understand natural language the same/accurate way we do?

Natural Language Processing (NLP)


Challenges in NLP
Part-of-speech tagging Text segmentation Word sense disambiguation Syntax ambiguity Imperfect or irregular input Speech acts

Dream of AI community
to have algorithms that are capable of automatically reading and obtaining knowledge from text

Natural Language Processing (NLP)


WordNet
A laboriously hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets A major resource for NLP Needs automation to be completed

Sentiment Analysis
A technique used to detect favorable and unfavorable opinions toward specific products and services See Application Case 7.3 for a CRM application

NLP Task Categories


Information retrieval Information extraction Named-entity recognition Question answering Automatic summarization Natural language generation & understanding Machine translation Foreign language reading & writing Speech recognition Text proofing Optical character recognition

Text Mining Applications


Marketing applications
Enables better CRM

Security applications
ECHELON, OASIS Deception detection
example coming up

Medicine and biology


Literature-based gene identification
example coming up

Academic applications
Research stream analysis - example coming up

Text Mining Applications


(gene/protein interaction identification)
Gene/ Protein
596 12043 24224 281020 42722 397276 D007962 D 016923 D 001773 D019254 D044465 D001769 D002477 D003643 D016158

Ontology Word

... xpression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53. e
185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523

POS

NN

IN

NN

IN

VBZ

IN

JJ

JJ

NN

NN

NN

CC

NN

IN NN

Shallow Parse

NP

PP

NP

NP

PP

NP

NP

PP NP

Text Mining Process


Context diagram for the text mining process
Software/hardware lim itations Privacy issues Linguistic lim itations

Unstructured data (text) Structured data (databases)

Extract Context-specific knowledge knowledge from available data sources A0

Dom ain expertise Tools and techniques

Text Mining Process


Task 1 Task 2 Task 3

Establish the Corpus: Collect & Organize the Domain Specific Unstructured Data

Create the TermDocument Matrix: Introduce Structure to the Corpus


Feedback

Extract Knowledge: Discover Novel Patterns from the T-D Matrix


Feedback

The inputs to the process includes a variety of relevant unstructured (and semistructured) data sources such as text, XML, HTML, etc.

The output of the Task 1 is a collection of documents in some digitized format for computer processing

The output of the Task 2 is a flat file called term-document matrix where the cells are populated with the term frequencies

The output of Task 3 is a number of problem specific classification, association, clustering models and visualizations

The three-step text mining process

Text Mining Process


Step 1: Establish the corpus
Collect all relevant unstructured data (e.g., textual documents, XML files, emails, Web pages, short notes, voice recordings) Digitize, standardize the collection (e.g., all in ASCII text files) Place the collection in a common place (e.g., in a flat file, or in a directory as separate files)

Text Mining Process


Step 2: Create the TermbyDocument Matrix
Terms Documents Document 1 Document 2 Document 3 Document 4 Document 5 Document 6 ... 1 1 2 1 1
in ve e stm ri nt sk tm a an ge n me are t g en in ri ee elo ng nt SA P ... e pm

ec roj p

ftw so

v de

1 1 3

Text Mining Process


Step 2: Create the TermbyDocument Matrix (TDM)
Should all terms be included?
Stop words, include words Synonyms, homonyms Stemming

What is the best representation of the indices (values in cells)?


Row counts; binary frequencies; log frequencies; Inverse document frequency

Text Mining Process


Step 2: Create the TermbyDocument Matrix (TDM)
TDM is a sparse matrix. How can we reduce the dimensionality of the TDM?
Manual a domain expert goes through it Eliminate terms with very few occurrences in very few documents (?) Transform the matrix using singular value decomposition (SVD) SVD is similar to principle component analysis

Text Mining Process


Step 2: Extract patterns/knowledge
Classification (text categorization) Clustering (natural groupings of text)
Improve search recall Improve search precision Scatter/gather Query-specific clustering

Association Trend Analysis ()

Web Mining
The term created by Orem Etzioni (1996) Application of data mining techniques to automatically discover and extract information from Web data

What is Web Mining?

Discovering useful information from the World-Wide Web and its usage patterns

Web Mining v. Data Mining


Structure (or lack of it)
Textual information and linkage structure

Scale
Data generated per day is comparable to largest conventional data warehouses

Speed
Often need to react to evolving usage patterns in real-time (e.g., merchandising)

Web Mining topics


Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Web Mining topics


Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Web Mining topics


Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Size of the Web


Number of pages
Technically, infinite Much duplication (30-40%) Best estimate of unique static HTML pages comes from search engine claims
Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion Google recently announced that their index contains 1 trillion pages
How to explain the discrepancy?

The web as a graph


Pages = nodes, hyperlinks = edges
Ignore content Directed graph

High linkage
10-20 links/page on average Power-law degree distribution

Structure of Web graph


Lets take a closer look at structure
Broder et al (2000) studied a crawl of 200M pages and other smaller crawls Bow-tie structure
Not a small world

Bow-tie Structure

Source: Broder et al, 2000

What can the graph tell us?


Distinguish important pages from unimportant ones
Page rank

Discover communities of related pages


Hubs and Authorities

Detect web spam


Trust rank

Web Mining topics


Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Power-law degree distribution

Source: Broder et al, 2000

Power-laws galore
Structure
In-degrees Out-degrees Number of pages per site

Usage patterns
Number of visitors Popularity e.g., products, movies, music

The Long Tail

Source: Chris Anderson (2004)

Web Mining topics


Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Extracting Structured Data

http://www.simplyhired.com

Extracting structured data

http://www.fatlens.com

Web Mining topics


Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Ads vs. search results

Ads vs. search results


Search advertising is the revenue model
Multi-billion-dollar industry Advertisers pay for clicks on their ads

Interesting problems
What ads to show for a search? If Im an advertiser, which search terms should I bid on and how much to bid?

Two Approaches to Analyzing Data


Machine Learning approach
Emphasizes sophisticated algorithms e.g., Support Vector Machines Data sets tend to be small, fit in memory

Data Mining approach


Emphasizes big data sets (e.g., in the terabytes) Data cannot even fit on a single disk! Necessarily leads to simpler algorithms

Web Mining topics


Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Systems architecture

CPU

Machine Learning, Statistics


Memory Classical Data Mining Disk

Very Large-Scale Data Mining

CPU Mem Disk

CPU Mem Disk

CPU Mem Disk

Cluster of commodity nodes

Systems Issues
Web data sets can be very large
Tens to hundreds of terabytes

Cannot mine on a single server!


Need large farms of servers

How to organize hardware/software to mine multi-terabye data sets


Without breaking the bank!

Project
Lots of interesting project ideas
If you cant think of one please come discuss with us

Infrastructure
Aster Data cluster on Amazon EC2 Supports both MapReduce and SQL

Data
Netflix ShareThis Google WebBase TREC

Data Mining vs. Web Mining


Traditional data mining
data is structured and relational well-defined tables, columns, rows, keys, and constraints.

Web data
Semi-structured and unstructured readily available data rich in features and patterns

Web Data

Web Structure
Click here to Shop Online tag

Web Data

Web Usage
Application Server logs Http logs

Web Data Web Content

Web Mining Categories


Web Content Mining
Discovering useful information from web contents/data/documents.

Web Structure Mining


Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs

Web Usage Mining


Make sense of data generated by surfers Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc.
99

Web Content Data Structure


Unstructured free text Semi-structured HTML More structured Table or Database generated HTML pages Multimedia data receive less attention than text or hypertext

100

Web Content Mining


Process of information or resource discovery from content of millions of sources across the World Wide Web
E.g. Web data contents: text, Image, audio, video, metadata and hyperlinks

Goes beyond key word extraction, or some simple statistics of words and phrases in documents.

Web Content Mining


Pre-processing data before web content mining: feature selection (Piramuthu 2003) Post-processing data can reduce ambiguous searching results (Sigletos & Paliouras 2003) Web Page Content Mining
Mines the contents of documents directly

Search Engine Mining


Improves on the content search of other tools like search engines.

Web Content Mining


Web content mining is related to data mining and text mining. [Bing Liu. 2005]
It is related to data mining because many data mining techniques can be applied in Web content mining. It is related to text mining because much of the web contents are texts. Web data are mainly semi-structured and/or unstructured, while data mining is structured and text is unstructured.

Web Content Mining: IR View


Unstructured Documents
Bag of words, or phrase-based feature representation Features can be boolean or frequency based Features can be reduced using different feature selection techniques Word stemming, combining morphological variations into one feature
104

Web Content Mining: IR View


Semi-Structured Documents
Uses richer representations for features, based on information from the document structure (typically HTML and hyperlinks) Uses common data mining methods (whereas unstructured might use more text mining methods)

105

Web Content Mining: DB View


Tries to infer the structure of a Web site or transform a Web site to become a database
Better information management Better querying on the Web

Can be achieved by:


Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database

106

Web-Structure Mining
Generate structural summary about the Web site and Web page
Depending upon the hyperlink, Categorizing the Web pages and the related Information @ inter domain level Discovering the Web Page Structure.

Discovering the nature of the hierarchy of hyperlinks in the website and its structure.

Web-Structure Mining
Finding Information about web pages

cont

Inference on Hyperlink

Retrieving information about the relevance and the quality of the web page. Finding the authoritative on the topic and content. The web page contains not only information but also hyperlinks, which contains huge amount of annotation. Hyperlink identifies authors endorsement of the other web page.

Web-Structure Mining

cont

More Information on Web Structure Mining


Web Page Categorization. (Chakrabarti 1998)

Finding micro communities on the web e.g. Google (Brin and Page, 1998)
Schema Discovery in Semi-Structured Environment.

Web Usage Mining


Tries to predict user behavior from interaction with the Web Wide range of data (logs)
Web client data Proxy server data Web server data Map usage data into relational tables before using adapted data mining techniques Use log data directly by utilizing special pre-processing techniques
110

Two common approaches

Web Usage Mining


Typical problems: Distinguishing among unique users, server sessions, episodes, etc in the presence of caching and proxy servers Often Usage Mining uses some background or domain knowledge
E.g. site topology, Web content, etc

111

Web Usage Mining

Two main categories:


Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site

112

Web-Usage Mining
Analysis:

cont

Data Mining Techniques Navigation Patterns


Example: 70% of users who accessed /company/product2 did so by starting at /company and proceeding through /company/new, /company/products and company/product1 80% of users who accessed the site started from /company/products 65% of users left the site after four or less page references

Web-Usage Mining
Customer John John Transaction Time 6/21/05 5:30 pm 6/22/05 10:20 pm

cont
Purchased Items Beer Brandy

Data Mining Techniques Sequential Patterns


Example: Supermarket Cont

Frank Frank Frank


Mary Mary Mary

6/20/05 10:15 am 6/20/05 11:50 am 6/20/05 12:50 am


6/20/05 2:30 pm 6/21/05 6:17 pm 6/22/05 5:05 pm

Juice, Coke Beer Wine, Cider


Beer Wine, Cider Brandy

Web-Usage Mining

cont

Data Mining Techniques Sequential Patterns


Customer Sequence Example: Supermarket Cont Mining Result
Sequential Patterns with Support >= 40% (Beer) (Brandy) (Beer) (Wine, Cider) Supporting Customers John, Mary Frank, Mary Customer John Frank Mary Customer Sequences (Beer) (Brandy) (Juice, Coke) (Beer) (Wine, Cider) (Beer) (Wine, Cider) (Brandy)

Web-Usage Mining

cont

Data Mining Techniques Sequential Patterns


Web usage examples In Google search, within past week 30% of users who visited /company/product/ had camera as text.

60% of users who placed an online order in /company/product1 also placed an order in /company/product4 within 15 days

Tech for Web Content Mining

Classifications Clustering Association

Document Classification
Supervised Learning
Supervised learning is a machine learning technique for creating a function from training data . Documents are categorized The output can predict a class label of the input object (called classification).

Techniques used are


Nearest Neighbor Classifier Feature Selection Decision Tree

Feature Selection
Removes terms in the training documents which are statistically uncorrelated with the class labels Simple heuristics Stop words like a, an, the etc. Empirically chosen thresholds for ignoring too frequent or too rare terms Discard too frequent and too rare terms

Document Clustering
Unsupervised Learning : a data set of input objects is gathered Goal : Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters. Hypothesis : Given a `suitable clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs. Hierarchical Bottom-Up Top-Down Partitional

Semi-Supervised Learning
A collection of documents is available A subset of the collection has known labels Goal: to label the rest of the collection. Approach Train a supervised learner using the labeled subset. Apply the trained learner on the remaining documents. Idea Harness information in the labeled subset to enable better learning. Also, check the collection for emergence of new topics

Association
Transaction ID Items Purchased

Example: Supermarket

1 2 3

butter, bread, milk bread, milk, beer, egg diaper

An association rule can be


If a customer buys milk, in 50% of cases, he/she also buys beers. This happens in 33% of all transactions. 50%: confidence 33%: support
Can also Integrate in Hyperlinks

Q&A