Algorithmics Research On Knowledge Discovery and Data Mining

Algorithmics Research on Knowledge Discovery and Data Mining
Vladimir Estivill-Castro
School of Computing and Information Technology
Outline
u u u
Motivation What is Data Mining Extent of The Field

u Web
Mining, Text Mining, Software Engineering and Data Mining
u Example
of algorithm
Oriented Generalization
Vladimir Estivill -Castro 2
u Attribute
Motivation
Motivation
u
The problem of data overload looms ominously ahead.

u
Our technology to analyze data and understand massive datasets lags far behind our technology to gather and store data.
Vladimir Estivill -Castro
Interest for Knowledge Discovery

u
Emerged in Databases
u
Large logs of transactions

u u
Credit card transactions Supermarket transactions
Data Mining (1960s)

u
Massaging the data so statistics would reflect my preconceived hypothesis Data => Hypothesis vs Hypothesis validated by data
Deductive vs Inductive Science

u
Knowledge Discovery
The nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patters in large data sets.
Data: The geo-referenced layers. u Information: The average population per administrative region u Knowledge: The patterns of growth of population densities and valid explanations for them.
u
Vladimir Estivill -Castro 7 6
Aspects of Knowledge Discovery

u
Nontrivial
u
u u
Potentially useful
u
it goes beyond computing closed- from quantities or evaluating models. the discovered patterns are true with some degree of certainty of unseen data
for the user simple, descriptive, informative
Understandable
u
Valid
u
What is Data Mining
What is Data Mining?

u
Different answers
Using computers to make sense of large volumes of data

u using
u u
a set of fundamental data manipulation tasks
classification association rules / basket analysis
what makes it different from Statistics?
A task in the process of knowledge discovery A multidisciplinary field

u what
makes it different from Statistics?

Mining Different Kinds of Knowledge

u Description u Estimation. u Prediction. u Affinity
Grouping .
u Classification. u Clustering.
Mining Different Kinds of Knowledge (by J. Han)

u
Characterization: Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions. Association: Rules like inside(x, city) near(x, highway). Classification: Classify data based on the values in a classifying attribute, e.g., classify countries based on climate. Clustering: Cluster data to form new classes, e.g., cluster houses to find distribution patterns. Trend and deviation analysis : Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., housing market analysis. Pattern-directed analysis : Find and characterize user-specified patterns in large databases, e.g., volcanoes on Mars.
u u
Example: Retail
u
Bar-code technology makes possible to collect and store massive amounts of sales data (the basked data). u Information driven marketing process demands mining association rules over basket data.
u
``98% of customers that purchase tires and auto accessories also get service done
Application of Rules
u
Cross-marketing and attached mailing applications.

u
customized and designed junk-mail
catalog design, add-on sales. u store layout u customer segmentation based on buying patterns.
Illustration
Product A Customer 1 Customer 2 Customer 3
Product B
Product C
Product D
X X
X X X X X X
``90% of purchases that have bread and butter also include milk. u It is a rule of the form A B (90%). u A is the antecedent. u B is the consequent. u There is a confidence value associated to the rule.
u
14
Usage of mining for rules

u Find all rules that have Diet Coke as the consequent. u Help plan what should be done to boast the sales of Diet Coke. u Find all the rules that have bagels in the antecedent. u Determine what products may be impacted if the store discontinues selling bagels.
15
Usage of mining rules

u
Find all rules that have sausage in the antecedent and mustard in the consequent u what items should be sold with sausages to make highly likely that mustard will also be sold. Find all rules relating items located in shelves A and B u understand if the distance affects the sales of items from both shelves.
16
The Knowledge Discovery Process

Arrangement Preprocessing Selection Data Mining Interpretation
Data
Target Data
Preprocessed Transformed Data Data
Patterns
Knowledge
17
A formal process (and a standard)
18
Multi-disciplinary
u
Finding useful patterns in data is known by different names among different communities:
u u u u u
data mining (statistics, databases) knowledge discovery information discovery, information harvesting data archeology pattern processing
A multi-disciplinary field
Data Bases
Statistic
Visualization Machine Learning Logic Programming

Knowledge Discovery
u
Discovery of useful knowledge from data.

u u u u u u u u u
Databases Machine Learning Pattern Recognition Artificial Intelligence Knowledge Acquisition Scientific Discovery High-Performance Computing Algorithms (Analysis and Design) Statistics
Differences with Statistics
22
A significant amount of overlap

u Statistics u KDDM
probabilistic models u descriptive, nonparametric, exploratory u mathematically sound (advanced) u informative and predictive
u
concern for computability and scalability u interpret data u understandable u data on electronic media (and structured)
u
Exploratory Analysis of Massive Data Sets

Extent of the Field
24
Large number of conferences and specialized workshops

u
KDD - ACM but now IEEE conference on KDD SIAM conference on KDD PAKDD (2001 - 5th Pacific Rim Conf. on KDD) PKDD (2000 - 4th European Conf. on Principles and Practice of KDD) Data Base conferences u SIGMOD
u u
u u u
VLDB Artificial Intelligence and Machine Learning u ICML u ECML

Successful Example
u Recent
u u u
analysis of Bank of America loan database
250 fields per customer back to 1914! over nine million records
u A clustering tools was used to automatically
segment customers into groups with many similar categorical attributes.

u 14 groups identified, only one could be explained
u
The interesting cluster had 2 properties

u u
39% of customers had business and personal accounts with the bank this cluster accounted for 27% of the 11% of customers that had been classified by a decision tree as likely respondents to a home equity loan offer
Text Mining
u u
With / without Natural Language Processing Different from Information Retrieval/Extraction
u Results:
u There
exists a text (or a sets of texts) such as speak-much-of Miss Lewinsky & speak-littleof Mrs. Clinton u There exists another text such as speak-a-lot-of Mrs. Clinton & do-not-speak-of Miss Lewinsky
Spatial Data Mining

u
Data Mining system generates hypothesis search (and visualization) in abstract space inductive generalizations exceeding content of database
GIS user generates hypothesis visualization in geographical space shows whats inside the data
hard to visualize multivariate dependencies on a map
28
Spatial Data Mining

Thematic Map Decision Tree
29
Spatial Associations
FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course" FROM Washington_Golf_courses, Washington WHERE CLOSE_TO(Washington_Golf_courses.Obj, Washington.Obj, "3 km") AND Washington.CFCC <> "D81" IN RELEVANCE TO Washington_Golf_courses.Obj, Washington.Obj, CFCC SET SUPPORT THRESHOLD 0.5
30
Spatial Associations & Hierarchy of Spatial Relationships

u
Spatial association: Association relationship containing spatial predicates, e.g., close_to, intersect, contains, etc.
u
Topological relations: u intersects, overlaps, disjoint, etc. Spatial orientations: u left_of, west_of, under, etc. Distance information:
u close_to, within_distance, etc. Hierarchy of spatial relationship:
u u
g_close_to: near_by, touch, intersect, contain, etc. First search for rough relationship and then refine it.
Example: Spatial Association Rule Mining

u What
kinds of spatial objects are close to each other in
B.C.?
u
Kinds of objects: cities, water, forests, usa_boundary, mines, etc.
u Rules
u u
mined:
is_a(x, large_town) ^ intersect(x, highway) adjacent_to(x, water). [7%, 85%] is_a(x, large_town) ^adjacent_to(x, georgia_strait) close_to(x, u.s.a.). [1%, 78%]
u Mining method:
Apriori + multi-level association + geospatial algorithms (from rough to high precision).

Spatial Classification
u Generalization-
based induction u Interactive classification

Visualization of Predicted Distribution
Spatial Prediction and Trend Analysis

u Spatial
u u u u
trend predictive modeling (Ester et al97):
Discover centers: local maximal of some non-spatial attribute. Determine the (theoretical) trend of some non-spatial attribute, when moving away from the centers. Discover deviations (from the theoretical trend). Explain the deviations.
u Example:
Trend of unemployment rate change according to the distance to Munich. u Similar modeling can be used to study trend of temperature with the altitude, degree of pollution in relevance to the regions of population density, etc.
Spatial Clustering
u How
can we cluster points? u What are the distinct features of the clusters?
There are more customers with university degrees in clusters located in the West. Thus, we can use different marketing strategies!
Mining in Image & Raster Databases

u Magellan
u u u u
project (Fayyad et al.96, JPL).
identify volcanos on Venus surface over 30,000 high resolution images Resolution accuracy: 80% 3 steps: data focusing, feature extraction, and classification learning
u POSSII
u u u
project (Palomar Obervatory Sky Survey II, )
2x stellar objects (galaxies, stars, etc.) classified Resolution:one magnitude better than in previous studies Classification accuracy: no normalization 75%, with normalization 94%, and compared with neural networks.
109
u QuakeFinder
(Stolorz et al96): Find earth quakes from
space.
u
using statistics, massive parallelism, and global optimization

Basket Analysis / Link Analysis

u Amazon.com
u u
Who bought what together What are related books
u Finding
u
who has fax at home
Get phone data for business with a fax number u Get usage records of lines to find who dials a business fax number from home for larger than 20 seconds
Crime detection
u crime
investigation (e.g., the Okalahoma City bombing) u fraud detection [Italy KDD-99 San Diego] but also Australian Taxation Office and HIC [PAKDD-99]
u
Characterization of Doctor Shoppers
39
Software Engineering and Data Mining
40
KDD techniques used in Software Engineering

Clustering
u (Mancoridis):
u Graph
Clustering for graph partitioning towards high-coupling and low-cohesion.

partitioning algorithms using hill-climbing
u (Ouyang)
Clustering towards improving the reusability in the design phase. u (Anquetil) Re-modularization of software
Decomposition into subsystemsNon OO case

u
u
Input from ISA

u
Software S made of a set of programs P, and set of files F.

p1p2p3p4p5 f1 X X X X f2 X X f3 X X X f4 X X X X
c p3= 3/4, cp4= 3/3
Representation of database of the system

A set alpha : A={P, F} : P P , F F
Creation of a grouping table

Programason rows, Files in columnss T = t1 , t2 , ... , t|F| ti={ p P | p uses fi}
u
KDDM
Associaiton rules c% of the programs that use file X also use files Y and Z. Results in a group of programs that uses a similar set of files.
p4 p3 (100%)
Colect and summirize results

u
Association rules are used to guide clustering

u
A set of programs and files makes a subsystem=> SUBSISTEMA

Decomposition into Subsystems - OO

u u
Using metrics, OO, KDDM and clustering to split with low coupling. Approach: ( Create list of related entity pairs (classes, methods and objects) ( Use OO metrics to create sets of metrics CBOSet, RFCSet, and DACSet (DAC: Data Abstraction Coupling) CBO_d, CBO_d, and DAC_t ( Generate matrices of classes that interact (matrices de interaccion) ( Apply KDDM algorithms (association rules)
CBOSet Interaction Matrix
Use hierarchical clustering on the coefficients of the association rules to produce a hierarchical decomposition.
Case Study - OO Sistema II

Mozilla - Netscape Communicator
u
HTML Editor
MOZILLA
HTML Composer/Editor - Mozilla SUMMARY Symbol Table Statistics of 30 projects ============================================== Files: 111 Includes: 697 Macros: 147 Functions: 60 Types: 0 Variables: 79 Enums: 4 Userdef: 0 Classes: 90 Inst Vars: 415 Methods: 1320 Friends: 25 Localdefs: 46 SUMMARY File Type Statistics of 30 projects ============================================== File Type: Number of Files: HTML Header IDL Interface Image Implementation Project Description 4 65 4 57 42 38
SUMMARY Symbol Table Statistics of 1223 projects ======================================== Files: 6713 Includes: 36492 Macros: 27024 Functions: 15898 Types: 3176 Variables: 11151 Enums: 715 Userdef: 0 Classes: 5933 Instance Variables: 23757 Methods: 41015 Friends: 273 Localdefs: 290 SUMMARY File Type Statistics of 1223 projects ======================================== File Type Number of files HTML Header Implementation Make Project Description 763 3331 3382 117 1309
Mozilla Statistics
HTML Editor Statistics

Graph Mining and WEB Mining

u Step
beyond link analysis

sub-graphs that are similar
molecules CAD/CAM parts (designs)
u Finding
u
u Chemical
Patterns of use u Analysis of WEB data u Text Mining/ Multimedia Mining

Time Series Mining

u
Predicting stock market Monitoring condition of equipment, weather, pilot behavior during long flights.
46
An example of generalization
u
For attribute-oriented data
47
Automated Attribute Oriented Induction

u
Illustration of basic strategies

based on a hierarchy of concepts, u where discovery initiated by a query for a rule with necessary conditions for a class.
u
48
The data set

NAME Anderson Bach Carlton Fraser STATUS M.A. junior junior M.S. PhD sophomore senior PhD MAJOR History Math Lib. Arts Physics Math Chemistry Computing Biology BIRTH PLACE Vancouver Calgary Edmonton Ottawa Bombay Richmond Victoria Shanghai GPA 3.5 3.7 2.6 3.9 3.3 2.7 3.5 3.4
STUDENT
Gupta Hart Jackson Liu
Wang Wise
M.S. freshman
Statistics Literature
Nanjing Toronto
3.2 3.9 49
The concept hierarchy

For attribute STATUS undergraduate ANY
graduate
freshman sophomore
junior
senior
M.A.
M.S.
PhD
50
The discovery query

u
In relation STUDENT, learn characteristic rule for STATUS=graduate in relevance to NAME, BIRTH PLACE, GPA u (threshold value of 3)
51
The induction
u 1)
Select graduate students.

STATUS M.A. M.S. PhD PhD PhD MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 1 1
NAME Anderson Frazer Gupta Liu Monk
Wang
M.S.
Statistics
Nanjing
3.2
52
The induction
u 1)
Select graduate students.

STATUS M.A. M.S. PhD PhD PhD MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 1 1
NAME Anderson Frazer Gupta Liu Monk
Wang
M.S.
Statistics
Nanjing
3.2
53
The induction
u 2)
Eliminate attribute STATUS because it is graduate for all students.

NAME Anderson Frazer Gupta Liu Monk MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 Wang Statistics Nanjing 3.2 1 54
The induction
u 3)
Generalize on the smallest decomposable components.

u
Not illustrated here, because we do not have composite attributes.
55
The induction
u
4) If there is a large set of distinct values for an attribute but there is no higher level concept provided for the attribute, the attribute should be removed.
u
Attribute NAME satisfies this.
56
The induction
u 2)
Eliminate attribute NAME because it has to many values.

NAME Anderson Frazer Gupta Liu Monk MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 Wang Statistics Nanjing 3.2 1 57
A generalization
MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 2 3 1 1 2
Statistics
Nanjing
3.2
The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples.
Concept tree ascension

u
4) If there is a higher level concept in the concept tree for an attribute, the substitution of the higher level concept generalizes the tuple.
Science Arts
Statistics
Computing
Biology
Math
Physics
History
59
A generalization of each attribute

MAJOR Art Science Science Science Science BIRTH PLACE British Columbia Ontario British Columbia India China GPA excellent excellent excellent good good VOTE 35 10 30 10 15
Threshold Control
u
If the number of distinct values of an attribute is larger than the generalization threshold value, further generalization on this attribute should be performed.
u u
Not the case for Major It is the case for Birth Place
61
Further generalization
MAJOR Art Science Science
BIRTH PLACE Canada Canada foreign
GPA excellent excellent good
VOTE 35 40 25
Transformation to rules
MAJOR {Art, Science} Science BIRTH PLACE Canada GPA excellent VOTE 37
foreign
good
25
A graduate student is either (with 75% probability) a Canadian with excellent GPA or (with 25% probability) a foreign student, majoring in science with a good GPA.
63
Questions?
64

Algorithmics Research On Knowledge Discovery and Data Mining

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Algorithmics Research On Knowledge Discovery and Data Mining

Încărcat de

Drepturi de autor:

Formate disponibile

Algorithmics Research on Knowledge Discovery and Data Mining

Motivation What is Data Mining Extent of The Field

Mining, Text Mining, Software Engineering and Data Mining

The problem of data overload looms ominously ahead.

Vladimir Estivill -Castro

Interest for Knowledge Discovery

Large logs of transactions

Credit card transactions Supermarket transactions

Data Mining (1960s)

Deductive vs Inductive Science

Aspects of Knowledge Discovery

for the user simple, descriptive, informative

Vladimir Estivill -Castro

What is Data Mining

What is Data Mining?

Using computers to make sense of large volumes of data

a set of fundamental data manipulation tasks

classification association rules / basket analysis

what makes it different from Statistics?

A task in the process of knowledge discovery A multidisciplinary field

makes it different from Statistics?

Mining Different Kinds of Knowledge

Vladimir Estivill -Castro

Mining Different Kinds of Knowledge (by J. Han)

Cross-marketing and attached mailing applications.

customized and designed junk-mail

Product A Customer 1 Customer 2 Customer 3

Vladimir Estivill -Castro

Usage of mining for rules

Vladimir Estivill -Castro

Usage of mining rules

Vladimir Estivill -Castro

The Knowledge Discovery Process

Preprocessed Transformed Data Data

Vladimir Estivill -Castro

A formal process (and a standard)

Vladimir Estivill -Castro

Visualization Machine Learning Logic Programming

Discovery of useful knowledge from data.

Differences with Statistics

Vladimir Estivill -Castro

A significant amount of overlap

Exploratory Analysis of Massive Data Sets

Extent of the Field

Large number of conferences and specialized workshops

VLDB Artificial Intelligence and Machine Learning u ICML u ECML

analysis of Bank of America loan database

u A clustering tools was used to automatically

segment customers into groups with many similar categorical attributes.

The interesting cluster had 2 properties

With / without Natural Language Processing Different from Information Retrieval/Extraction

Spatial Data Mining

hard to visualize multivariate dependencies on a map

Vladimir Estivill -Castro

Spatial Data Mining

Vladimir Estivill -Castro

Vladimir Estivill -Castro

Spatial Associations & Hierarchy of Spatial Relationships

u close_to, within_distance, etc. Hierarchy of spatial relationship:

Vladimir Estivill -Castro

Example: Spatial Association Rule Mining

kinds of spatial objects are close to each other in

Kinds of objects: cities, water, forests, usa_boundary, mines, etc.