Documente Academic
Documente Profesional
Documente Cultură
Delivered by:
Nur Fatih
Outline
Most popular DM functionalities
Frequent Pattern Mining
Association
Classification and Prediction
Clustering
Learning Algorithms
Most Popular DM Functionalities
Frequent pattern and association
Classification and prediction (supervised learning)
Cluster analysis (unsupervised learning)
Outliers analysis
Mining for Frequent Item-sets
The Apriori Algorithm:
General Procedure:
Example:
Algorithm
Training to construct
Data Classifier
(Model)
Classifier
(Model)
Training
Data Unseen
Data
Name Rank Years Tenured
(Jeff, Professor, 4)
Tom Assistant Prof 2 No
Lisa Associate Prof 7 No Tenured?
George Professor 5 Yes
YES!
Joseph Assistant Prof 7 Yes
Issues: Classification Methods
Evaluation
Accuracy
Classifier accuracy: predicting class label
Predictor accuracy: guessing value of predicted attributes
Speed
Time to construct the model (training time)
Time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk resident databases
Interpretability
Understanding and insight provided by the model
Cluster Analysis
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar data
object into clusters
Unsupervised learning: no predefined classes
Typical applications
Stand alone tool to get insight into data distribution
Preprocessing step for other algorithm
Clustering: Multidisciplinary Efforts
Pattern recognition
Spatial Data Analysis
Thematic maps in GIS by clustering feature spaces
Image processing
Economic science (market research)
WWW
Document classification
Examples of Clustering Application
Marketing: help marketers discover distinct group
in their customer bases, and then use this
knowledge to develop targeted marketing
programs
Land use: identification of areas of similar land use
in an earth observation database
Insurance: identifying groups of bike insurance
policy holders with high average claim cost
…
Clustering Quality
Good clustering produces clusters with
High intra-class similarity
Low inter-classes similarity
Given n elements x1, x2, … xn, and k clusters, each with a center.
1. Assign each element to its closest cluster center
2. After all assignments have been made, compute the cluster centroids for
each of the cluster
3. Repeat the above two steps with the new centroids until the algorithm
converges
Classification vs. Clustering
A Different Perspective!!!