Documente Academic
Documente Profesional
Documente Cultură
Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory
and Advanced Topics, Prentice Hall, 2002.
© Prentice Hall 1
Data Mining Outline
• PART I
– Introduction
– Related Concepts
– Data Mining Techniques
• PART II
– Classification
– Clustering
– Association Rules
• PART III
– Web Mining
– Spatial Mining
– Temporal Mining
© Prentice Hall 2
Introduction Outline
© Prentice Hall 3
Introduction
© Prentice Hall 4
Data Mining Definition
© Prentice Hall 5
Data Mining Algorithm
© Prentice Hall 6
Database Processing vs. Data
Mining Processing
• Query • Query
– Well defined – Poorly defined
– SQL – No precise query language
Data Data
– Operational data – Not operational data
Output Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
© Prentice Hall 7
Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the
last month.
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with milk. (association
rules)
© Prentice Hall 8
Data Mining Models and Tasks
© Prentice Hall 9
Basic Data Mining Tasks
© Prentice Hall 10
Basic Data Mining Tasks (cont’d)
© Prentice Hall 11
Ex: Time Series Analysis
© Prentice Hall 12
Data Mining vs. KDD
© Prentice Hall 13
KDD Process
© Prentice Hall 15
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms •Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms
© Prentice Hall 16
KDD Issues
• Human Interaction
• Overfitting
• Outliers
• Interpretation
• Visualization
• Large Datasets
• High Dimensionality
© Prentice Hall 17
KDD Issues (cont’d)
• Multimedia Data
• Missing Data
• Irrelevant Data
• Noisy Data
• Changing Data
• Integration
• Application
© Prentice Hall 18
Social Implications of DM
• Privacy
• Profiling
• Unauthorized use
© Prentice Hall 19
Data Mining Metrics
• Usefulness
• Return on Investment (ROI)
• Accuracy
• Space/Time
© Prentice Hall 20
Database Perspective on Data Mining
• Scalability
• Real World Data
• Updates
• Ease of Use
© Prentice Hall 21
Visualization Techniques
• Graphical
• Geometric
• Icon-based
• Pixel-based
• Hierarchical
• Hybrid
© Prentice Hall 22
Related Concepts Outline
© Prentice Hall 23
DB & OLTP Systems
• Schema
– (ID,Name,Address,Salary,JobNo)
• Data Model
– ER
– Relational
• Transaction
• Query:
SELECT Name
FROM T
WHERE Salary > 100000
© Prentice Hall 24
Fuzzy Sets and Logic
© Prentice Hall 25
Fuzzy Sets
© Prentice Hall 26
Classification/Prediction is
Fuzzy
Accept Accept
Simple Fuzzy
© Prentice Hall 27
Information Retrieval
© Prentice Hall 28
Information Retrieval (cont’d)
IR Classification
© Prentice Hall 30
Dimensional Modeling
• Dimensional modeling is a different way to view and
interrogate data in database
• View data in a hierarchical manner more as business
executives might
• Useful in decision support systems and mining
• Dimension: collection of logically related attributes;
axis for modeling data.Helps to describe dimensional
values.
• Facts: The specific data stored are called facts.
• Fact table consists of facts and foreign keys to
dimension tables.
• Ex: Dimensions – products, locations, date
Facts – quantity, unit price
© Prentice Hall 31
Relational View of Data
ProdID LocID Date Quantity UnitPrice
123 Dallas 022900 5 25
123 Houston 020100 10 20
150 Dallas 031500 1 100
150 Dallas 031500 5 95
150 Fort 021000 5 80
Worth
150 Chicago 012000 20 75
200 Seattle 030100 5 50
300 Rochester 021500 200 5
500 Bradenton 022000 15 20
500 Chicago 012000 10 25
1
© Prentice Hall 32
Dimensional Modeling Queries
© Prentice Hall 33
Cube view of Data
© Prentice Hall 34
Aggregation Hierarchies
© Prentice Hall 35
Star Schema
© Prentice Hall 37
Data Warehousing
© Prentice Hall 39
OLAP
© Prentice Hall 40
OLAP Operations
Roll Up
Drill Down
© Prentice Hall 41
Statistics
© Prentice Hall 42
Machine Learning
© Prentice Hall 43
Pattern Matching (Recognition)
© Prentice Hall 44
DM vs. Related Topics
© Prentice Hall 45
Data Mining Techniques Outline
© Prentice Hall 47
Estimation Error
• Why square?
• Root Mean Square Error (RMSE)
© Prentice Hall 48
Jackknife Estimate
© Prentice Hall 49
Maximum Likelihood Estimate (MLE)
• Maximize L.
© Prentice Hall 50
MLE Example
© Prentice Hall 51
MLE Example (cont’d)
© Prentice Hall 53
EM Example
© Prentice Hall 54
EM Algorithm
© Prentice Hall 55
Models Based on Summarization
© Prentice Hall 56
Scatter Diagram
© Prentice Hall 57
Bayes Theorem
© Prentice Hall 58
Bayes Theorem Example
© Prentice Hall 59
Bayes Example(cont’d)
• Training Data:
ID Income Credit Class xi
1 4 Excellent h1 x4
2 3 Good h1 x7
3 2 Excellent h1 x2
4 3 Good h1 x7
5 4 Good h1 x8
6 2 Excellent h1 x2
7 3 Bad h2 x11
8 2 Bad h2 x10
9 3 Bad h3 x11
10 1 Bad h4 x9
© Prentice Hall 60
Bayes Example(cont’d)
© Prentice Hall 61
Hypothesis Testing
© Prentice Hall 62
Chi Squared Statistic
• O – observed value
• E – Expected value based on hypothesis.
• Ex:
– O={50,93,67,78,87}
– E=75
– c2=15.55 and therefore significant
© Prentice Hall 63
Regression
© Prentice Hall 64
Linear Regression
© Prentice Hall 65
Correlation
© Prentice Hall 66
Similarity Measures
© Prentice Hall 67
Similarity Measures
© Prentice Hall 68
Distance Measures
© Prentice Hall 69
Twenty Questions Game
© Prentice Hall 70
Decision Trees
© Prentice Hall 72
Decision Trees
© Prentice Hall 73
Decision Tree Algorithm
© Prentice Hall 74
DT Advantages/Disadvantages
• Advantages:
– Easy to understand.
– Easy to generate rules
• Disadvantages:
– May suffer from overfitting.
– Classifies by rectangular partitioning.
– Does not easily handle nonnumeric data.
– Can be quite large – pruning is necessary.
© Prentice Hall 75
Neural Networks
• Based on observed functioning of human brain.
• (Artificial Neural Networks (ANN)
• Our view of neural networks is very simplistic.
• We view a neural network (NN) from a graphical
viewpoint.
• Alternatively, a NN may be viewed from the
perspective of matrices.
• Used in pattern recognition, speech recognition,
computer vision, and classification.
© Prentice Hall 76
Neural Networks
© Prentice Hall 77
Neural Network Example
© Prentice Hall 78
NN Node
© Prentice Hall 79
NN Activation Functions
© Prentice Hall 80
NN Activation Functions
© Prentice Hall 81
NN Learning
© Prentice Hall 82
Neural Networks
© Prentice Hall 83
Neural Networks
© Prentice Hall 84
NN Advantages
• Learning
• Can continue learning even after training set
has been applied.
• Easy parallelization
• Solves many problems
© Prentice Hall 85
NN Disadvantages
• Difficult to understand
• May suffer from overfitting
• Structure of graph must be determined a
priori.
• Input values must be numeric.
• Verification difficult.
© Prentice Hall 86
Genetic Algorithms
© Prentice Hall 87
Genetic Algorithms
© Prentice Hall 88
Crossover Examples
© Prentice Hall 89
Genetic Algorithm
© Prentice Hall 90
GA Advantages/Disadvantages
• Advantages
– Easily parallelized
• Disadvantages
– Difficult to understand and explain to end users.
– Abstraction of the problem and method to
represent individuals is quite difficult.
– Determining fitness function is difficult.
– Determining how to perform crossover and
mutation is difficult.
© Prentice Hall 91