Sunteți pe pagina 1din 32

Algorithmics Research on Knowledge Discovery and Data Mining

Vladimir Estivill-Castro
School of Computing and Information Technology

Outline
u u u

Motivation What is Data Mining Extent of The Field


u Web

Mining, Text Mining, Software Engineering and Data Mining

u Example

of algorithm
Oriented Generalization
Vladimir Estivill -Castro 2

u Attribute

Motivation

Motivation
u

The problem of data overload looms ominously ahead.


u

Our technology to analyze data and understand massive datasets lags far behind our technology to gather and store data.

Vladimir Estivill -Castro

Interest for Knowledge Discovery


u

Emerged in Databases
u

Large logs of transactions


u u

Credit card transactions Supermarket transactions

Data Mining (1960s)


u

Massaging the data so statistics would reflect my preconceived hypothesis Data => Hypothesis vs Hypothesis validated by data
Vladimir Estivill -Castro 5

Deductive vs Inductive Science


u

Knowledge Discovery
The nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patters in large data sets.
Data: The geo-referenced layers. u Information: The average population per administrative region u Knowledge: The patterns of growth of population densities and valid explanations for them.
u
Vladimir Estivill -Castro 7 6

Aspects of Knowledge Discovery


u

Nontrivial
u

u u

Potentially useful
u

it goes beyond computing closed- from quantities or evaluating models. the discovered patterns are true with some degree of certainty of unseen data

for the user simple, descriptive, informative

Understandable
u

Valid
u

Vladimir Estivill -Castro

What is Data Mining

What is Data Mining?


u

Different answers

Using computers to make sense of large volumes of data


u using
u u

a set of fundamental data manipulation tasks

classification association rules / basket analysis

what makes it different from Statistics?

A task in the process of knowledge discovery A multidisciplinary field


u what

makes it different from Statistics?


Vladimir Estivill -Castro 9

Mining Different Kinds of Knowledge


u Description u Estimation. u Prediction. u Affinity

Grouping .

u Classification. u Clustering.

Vladimir Estivill -Castro

Mining Different Kinds of Knowledge (by J. Han)


u

Characterization: Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions. Association: Rules like inside(x, city) near(x, highway). Classification: Classify data based on the values in a classifying attribute, e.g., classify countries based on climate. Clustering: Cluster data to form new classes, e.g., cluster houses to find distribution patterns. Trend and deviation analysis : Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., housing market analysis. Pattern-directed analysis : Find and characterize user-specified patterns in large databases, e.g., volcanoes on Mars.
Vladimir Estivill -Castro

u u

Example: Retail
u

Bar-code technology makes possible to collect and store massive amounts of sales data (the basked data). u Information driven marketing process demands mining association rules over basket data.
u

``98% of customers that purchase tires and auto accessories also get service done
Vladimir Estivill -Castro 12

Application of Rules
u

Cross-marketing and attached mailing applications.


u

customized and designed junk-mail

catalog design, add-on sales. u store layout u customer segmentation based on buying patterns.
Vladimir Estivill -Castro 13

Illustration

Product A Customer 1 Customer 2 Customer 3

Product B

Product C

Product D

X X

X X X X X X

``90% of purchases that have bread and butter also include milk. u It is a rule of the form A B (90%). u A is the antecedent. u B is the consequent. u There is a confidence value associated to the rule.
u

Vladimir Estivill -Castro

14

Usage of mining for rules


u Find all rules that have Diet Coke as the consequent. u Help plan what should be done to boast the sales of Diet Coke. u Find all the rules that have bagels in the antecedent. u Determine what products may be impacted if the store discontinues selling bagels.

Vladimir Estivill -Castro

15

Usage of mining rules


u

Find all rules that have sausage in the antecedent and mustard in the consequent u what items should be sold with sausages to make highly likely that mustard will also be sold. Find all rules relating items located in shelves A and B u understand if the distance affects the sales of items from both shelves.

Vladimir Estivill -Castro

16

The Knowledge Discovery Process


Arrangement Preprocessing Selection Data Mining Interpretation

Data

Target Data

Preprocessed Transformed Data Data

Patterns

Knowledge

Vladimir Estivill -Castro

17

A formal process (and a standard)

Vladimir Estivill -Castro

18

Multi-disciplinary
u

Finding useful patterns in data is known by different names among different communities:
u u u u u

data mining (statistics, databases) knowledge discovery information discovery, information harvesting data archeology pattern processing
Vladimir Estivill -Castro 19

A multi-disciplinary field

Data Bases

Statistic

Visualization Machine Learning Logic Programming


Vladimir Estivill -Castro 20

Knowledge Discovery
u

Discovery of useful knowledge from data.


u u u u u u u u u

Databases Machine Learning Pattern Recognition Artificial Intelligence Knowledge Acquisition Scientific Discovery High-Performance Computing Algorithms (Analysis and Design) Statistics
Vladimir Estivill -Castro 21

Differences with Statistics

Vladimir Estivill -Castro

22

A significant amount of overlap


u Statistics u KDDM

probabilistic models u descriptive, nonparametric, exploratory u mathematically sound (advanced) u informative and predictive
u

concern for computability and scalability u interpret data u understandable u data on electronic media (and structured)
u

Exploratory Analysis of Massive Data Sets


Vladimir Estivill -Castro 23

Extent of the Field

24

Large number of conferences and specialized workshops


u

KDD - ACM but now IEEE conference on KDD SIAM conference on KDD PAKDD (2001 - 5th Pacific Rim Conf. on KDD) PKDD (2000 - 4th European Conf. on Principles and Practice of KDD) Data Base conferences u SIGMOD
u u

u u u

VLDB Artificial Intelligence and Machine Learning u ICML u ECML


Vladimir Estivill -Castro 25

Successful Example
u Recent
u u u

analysis of Bank of America loan database

250 fields per customer back to 1914! over nine million records

u A clustering tools was used to automatically

segment customers into groups with many similar categorical attributes.


u 14 groups identified, only one could be explained
u

The interesting cluster had 2 properties


u u

39% of customers had business and personal accounts with the bank this cluster accounted for 27% of the 11% of customers that had been classified by a decision tree as likely respondents to a home equity loan offer
Vladimir Estivill -Castro 26

Text Mining
u u

With / without Natural Language Processing Different from Information Retrieval/Extraction

u Results:
u There

exists a text (or a sets of texts) such as speak-much-of Miss Lewinsky & speak-littleof Mrs. Clinton u There exists another text such as speak-a-lot-of Mrs. Clinton & do-not-speak-of Miss Lewinsky
Vladimir Estivill -Castro 27

Spatial Data Mining


u

Data Mining system generates hypothesis search (and visualization) in abstract space inductive generalizations exceeding content of database

GIS user generates hypothesis visualization in geographical space shows whats inside the data

hard to visualize multivariate dependencies on a map

Vladimir Estivill -Castro

28

Spatial Data Mining


Thematic Map Decision Tree

Vladimir Estivill -Castro

29

Spatial Associations
FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course" FROM Washington_Golf_courses, Washington WHERE CLOSE_TO(Washington_Golf_courses.Obj, Washington.Obj, "3 km") AND Washington.CFCC <> "D81" IN RELEVANCE TO Washington_Golf_courses.Obj, Washington.Obj, CFCC SET SUPPORT THRESHOLD 0.5

Vladimir Estivill -Castro

30

Spatial Associations & Hierarchy of Spatial Relationships


u

Spatial association: Association relationship containing spatial predicates, e.g., close_to, intersect, contains, etc.
u

Topological relations: u intersects, overlaps, disjoint, etc. Spatial orientations: u left_of, west_of, under, etc. Distance information:

u close_to, within_distance, etc. Hierarchy of spatial relationship:

u u

g_close_to: near_by, touch, intersect, contain, etc. First search for rough relationship and then refine it.

Vladimir Estivill -Castro

Example: Spatial Association Rule Mining


u What

kinds of spatial objects are close to each other in

B.C.?
u

Kinds of objects: cities, water, forests, usa_boundary, mines, etc.

u Rules
u u

mined:

is_a(x, large_town) ^ intersect(x, highway) adjacent_to(x, water). [7%, 85%] is_a(x, large_town) ^adjacent_to(x, georgia_strait) close_to(x, u.s.a.). [1%, 78%]

u Mining method:

Apriori + multi-level association + geospatial algorithms (from rough to high precision).


Vladimir Estivill -Castro

Spatial Classification

u Generalization-

based induction u Interactive classification


Vladimir Estivill -Castro 33

Visualization of Predicted Distribution

Vladimir Estivill -Castro

Spatial Prediction and Trend Analysis


u Spatial
u u u u

trend predictive modeling (Ester et al97):

Discover centers: local maximal of some non-spatial attribute. Determine the (theoretical) trend of some non-spatial attribute, when moving away from the centers. Discover deviations (from the theoretical trend). Explain the deviations.

u Example:

Trend of unemployment rate change according to the distance to Munich. u Similar modeling can be used to study trend of temperature with the altitude, degree of pollution in relevance to the regions of population density, etc.
Vladimir Estivill -Castro

Spatial Clustering
u How

can we cluster points? u What are the distinct features of the clusters?

There are more customers with university degrees in clusters located in the West. Thus, we can use different marketing strategies!
Vladimir Estivill -Castro 36

Mining in Image & Raster Databases


u Magellan
u u u u

project (Fayyad et al.96, JPL).

identify volcanos on Venus surface over 30,000 high resolution images Resolution accuracy: 80% 3 steps: data focusing, feature extraction, and classification learning

u POSSII
u u u

project (Palomar Obervatory Sky Survey II, )

2x stellar objects (galaxies, stars, etc.) classified Resolution:one magnitude better than in previous studies Classification accuracy: no normalization 75%, with normalization 94%, and compared with neural networks.

109

u QuakeFinder

(Stolorz et al96): Find earth quakes from

space.
u

using statistics, massive parallelism, and global optimization


Vladimir Estivill -Castro

Basket Analysis / Link Analysis


u Amazon.com
u u

Who bought what together What are related books

u Finding
u

who has fax at home

Get phone data for business with a fax number u Get usage records of lines to find who dials a business fax number from home for larger than 20 seconds
Vladimir Estivill -Castro 38

Crime detection
u crime

investigation (e.g., the Okalahoma City bombing) u fraud detection [Italy KDD-99 San Diego] but also Australian Taxation Office and HIC [PAKDD-99]
u

Characterization of Doctor Shoppers

Vladimir Estivill -Castro

39

Software Engineering and Data Mining

Vladimir Estivill -Castro

40

KDD techniques used in Software Engineering


Clustering
u (Mancoridis):
u Graph

Clustering for graph partitioning towards high-coupling and low-cohesion.


partitioning algorithms using hill-climbing

u (Ouyang)

Clustering towards improving the reusability in the design phase. u (Anquetil) Re-modularization of software
Vladimir Estivill -Castro 41

Decomposition into subsystemsNon OO case


u
u

Input from ISA


u

Software S made of a set of programs P, and set of files F.


p1p2p3p4p5 f1 X X X X f2 X X f3 X X X f4 X X X X
c p3= 3/4, cp4= 3/3

Representation of database of the system


A set alpha : A={P, F} : P P , F F

Creation of a grouping table


Programason rows, Files in columnss T = t1 , t2 , ... , t|F| ti={ p P | p uses fi}
u

KDDM
Associaiton rules c% of the programs that use file X also use files Y and Z. Results in a group of programs that uses a similar set of files.

p4 p3 (100%)

Colect and summirize results


u

Association rules are used to guide clustering


u

A set of programs and files makes a subsystem=> SUBSISTEMA


Vladimir Estivill -Castro 42

Decomposition into Subsystems - OO


u u

Using metrics, OO, KDDM and clustering to split with low coupling. Approach: ( Create list of related entity pairs (classes, methods and objects) ( Use OO metrics to create sets of metrics CBOSet, RFCSet, and DACSet (DAC: Data Abstraction Coupling) CBO_d, CBO_d, and DAC_t ( Generate matrices of classes that interact (matrices de interaccion) ( Apply KDDM algorithms (association rules)

CBOSet Interaction Matrix

Use hierarchical clustering on the coefficients of the association rules to produce a hierarchical decomposition.
Vladimir Estivill -Castro 43

Case Study - OO Sistema II


Mozilla - Netscape Communicator
u

HTML Editor
MOZILLA
HTML Composer/Editor - Mozilla SUMMARY Symbol Table Statistics of 30 projects ============================================== Files: 111 Includes: 697 Macros: 147 Functions: 60 Types: 0 Variables: 79 Enums: 4 Userdef: 0 Classes: 90 Inst Vars: 415 Methods: 1320 Friends: 25 Localdefs: 46 SUMMARY File Type Statistics of 30 projects ============================================== File Type: Number of Files: HTML Header IDL Interface Image Implementation Project Description 4 65 4 57 42 38

SUMMARY Symbol Table Statistics of 1223 projects ======================================== Files: 6713 Includes: 36492 Macros: 27024 Functions: 15898 Types: 3176 Variables: 11151 Enums: 715 Userdef: 0 Classes: 5933 Instance Variables: 23757 Methods: 41015 Friends: 273 Localdefs: 290 SUMMARY File Type Statistics of 1223 projects ======================================== File Type Number of files HTML Header Implementation Make Project Description 763 3331 3382 117 1309

Mozilla Statistics

HTML Editor Statistics


Vladimir Estivill -Castro 44

Graph Mining and WEB Mining


u Step

beyond link analysis


sub-graphs that are similar
molecules CAD/CAM parts (designs)

u Finding
u

u Chemical

Patterns of use u Analysis of WEB data u Text Mining/ Multimedia Mining


Vladimir Estivill -Castro 45

Time Series Mining


u

Predicting stock market Monitoring condition of equipment, weather, pilot behavior during long flights.

Vladimir Estivill -Castro

46

An example of generalization
u

For attribute-oriented data

Vladimir Estivill -Castro

47

Automated Attribute Oriented Induction


u

Illustration of basic strategies


based on a hierarchy of concepts, u where discovery initiated by a query for a rule with necessary conditions for a class.
u

Vladimir Estivill -Castro

48

The data set


NAME Anderson Bach Carlton Fraser STATUS M.A. junior junior M.S. PhD sophomore senior PhD MAJOR History Math Lib. Arts Physics Math Chemistry Computing Biology BIRTH PLACE Vancouver Calgary Edmonton Ottawa Bombay Richmond Victoria Shanghai GPA 3.5 3.7 2.6 3.9 3.3 2.7 3.5 3.4

STUDENT

Gupta Hart Jackson Liu

Wang Wise

M.S. freshman

Statistics Literature

Nanjing Toronto

3.2 3.9 49

Vladimir Estivill -Castro

The concept hierarchy


For attribute STATUS undergraduate ANY

graduate

freshman sophomore

junior

senior

M.A.

M.S.

PhD

Vladimir Estivill -Castro

50

The discovery query


u

In relation STUDENT, learn characteristic rule for STATUS=graduate in relevance to NAME, BIRTH PLACE, GPA u (threshold value of 3)

Vladimir Estivill -Castro

51

The induction
u 1)

Select graduate students.


STATUS M.A. M.S. PhD PhD PhD MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 1 1

NAME Anderson Frazer Gupta Liu Monk

Wang

M.S.

Statistics

Nanjing

3.2

Vladimir Estivill -Castro

52

The induction
u 1)

Select graduate students.


STATUS M.A. M.S. PhD PhD PhD MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 1 1

NAME Anderson Frazer Gupta Liu Monk

Wang

M.S.

Statistics

Nanjing

3.2

Vladimir Estivill -Castro

53

The induction
u 2)

Eliminate attribute STATUS because it is graduate for all students.


NAME Anderson Frazer Gupta Liu Monk MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 Wang Statistics Nanjing 3.2 1 54

Vladimir Estivill -Castro

The induction
u 3)

Generalize on the smallest decomposable components.


u

Not illustrated here, because we do not have composite attributes.

Vladimir Estivill -Castro

55

The induction
u

4) If there is a large set of distinct values for an attribute but there is no higher level concept provided for the attribute, the attribute should be removed.
u

Attribute NAME satisfies this.

Vladimir Estivill -Castro

56

The induction
u 2)

Eliminate attribute NAME because it has to many values.


NAME Anderson Frazer Gupta Liu Monk MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 Wang Statistics Nanjing 3.2 1 57

Vladimir Estivill -Castro

A generalization
MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 2 3 1 1 2

Statistics

Nanjing

3.2

The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples.
Vladimir Estivill -Castro 58

Concept tree ascension


u

4) If there is a higher level concept in the concept tree for an attribute, the substitution of the higher level concept generalizes the tuple.
Science Arts

Statistics

Computing

Biology

Math

Physics

History

Vladimir Estivill -Castro

59

A generalization of each attribute


MAJOR Art Science Science Science Science BIRTH PLACE British Columbia Ontario British Columbia India China GPA excellent excellent excellent good good VOTE 35 10 30 10 15

The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples.
Vladimir Estivill -Castro 60

Threshold Control
u

If the number of distinct values of an attribute is larger than the generalization threshold value, further generalization on this attribute should be performed.
u u

Not the case for Major It is the case for Birth Place

Vladimir Estivill -Castro

61

Further generalization

MAJOR Art Science Science

BIRTH PLACE Canada Canada foreign

GPA excellent excellent good

VOTE 35 40 25

The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples.
Vladimir Estivill -Castro 62

Transformation to rules
MAJOR {Art, Science} Science BIRTH PLACE Canada GPA excellent VOTE 37

foreign

good

25

A graduate student is either (with 75% probability) a Canadian with excellent GPA or (with 25% probability) a foreign student, majoring in science with a good GPA.

Vladimir Estivill -Castro

63

Questions?

64

S-ar putea să vă placă și