Data Mining Functionalities: Delivered By: Nur Fatih

DATA MINING FUNCTIONALITIES
Delivered by:
Nur Fatih
Outline
 Most popular DM functionalities
 Frequent Pattern Mining
 Association
 Classification and Prediction
 Clustering
 Learning Algorithms
Most Popular DM Functionalities
 Frequent pattern and association
 Classification and prediction (supervised learning)
 Cluster analysis (unsupervised learning)
 Outliers analysis
Mining for Frequent Item-sets
The Apriori Algorithm:
Given minimum required support s as interestingness criterion:

1. Search for all individual elements (1-element item-set) that have a minimum
support of s
2. Repeat
1. From the results of the previous search for i-element item-sets, search for
all i+1 element item-sets that have a minimum support of s
2. This becomes the set of all frequent (i+1)-element item-sets that are
interesting
3. Until item-set size reaches maximum..
The Apriori Algorithm: (Example)
Let minimum support = 0.3
Interesting 1-element item-sets:

Bag Uniform Crayons {Bag}, {Uniform}, {Crayons}, {Pencil},
Books Bag Uniform
{Books}
Bag Uniform Pencil
Bag Pencil Books
Uniform Crayons Bag Interesting 2-element item-sets:
Bag Pencil Books {Bag,Uniform} {Bag,Crayons} {Bag,Pencil}
Crayons Uniform Bag {Bag,Books} {Uniform,Crayons}
Books Crayons Bag {Uniform,Pencil} {Pencil,Books}
Uniform Crayons Pencil
Pencil Uniform Books
The Apriori Algorithm: (Example)
Let minimum support = 0.3
Interesting 3-element item-sets:

Bag Uniform Crayons {Bag,Uniform,Crayons}
Books Bag Uniform
Bag Uniform Pencil
Bag Pencil Books
Uniform Crayons Bag
Bag Pencil Books
Crayons Uniform Bag
Books Crayons Bag
Association Concepts
Associations and Item-sets:
An association is a rule of the form: if X then Y.

It is denoted as X Y
Example:
If Indonesia wins in world cup, sales of jerseys go up.
For any rule if X Y Y  X, then X and Y are called

an “interesting item-set”.
Example:
People buying school uniforms in June also buy school bags
(People buying school bags in June also buy school uniforms)
Association Concepts (cont.)
Support and Confidence:
The support for a rule R is the ratio of the number of occurrences

of R, given all occurrences of all rules.
The confidence of a rule X Y, is the ratio of the number of

occurrences of Y given X, among all other occurrences given X.
Association Concepts (cont.)
Support and Confidence:
Support for Bag  Uniform =
5/10 = 0.5
Bag Uniform Crayons

Books Bag Uniform Confidence for Bag  Uniform =
Bag Uniform Pencil 5/8 = 0.625
Bag Pencil Book
Uniform Crayons Bag
Bag Pencil Book
Crayons Uniform Bag
Books Crayons Bag
Mining for Association Rules
Association rules are of the form

AB
Bag Uniform Crayons Which are directional…

Books Bag Uniform
Bag Uniform Pencil Association rule mining requires two
Bag Pencil Books thresholds:
Uniform Crayons Bag
Bag Pencil Books minsup and minconf
Crayons Uniform Bag
Books Crayons Bag
Mining association rules using apriori
General Procedure:
1. Use apriori to generate frequent

Bag Uniform Crayons itemsets of different sizes
Books Bag Uniform 2. At each iteration divide each frequent
Bag Uniform Pencil itemset X into two parts LHS and
Bag Pencil Books
RHS. This represents a rule of the
Uniform Crayons Bag
Bag Pencil Books form LHS  RHS
Crayons Uniform Bag 3. The confidence of such a rule is
Books Crayons Bag support(X)/support(LHS)
Uniform Crayons Pencil 4. Discard all rules whose confidence is
Pencil Uniform Books less than minconf.
Example:
The frequent itemset {Bag, Uniform,

Bag Uniform Crayons Crayons} has a support of 0.3.
Books Bag Uniform
Bag Uniform Pencil This can be divided into the following
Bag Pencil Books
rules:
Uniform Crayons Bag
Bag Pencil Books {Bag}  {Uniform, Crayons}
Crayons Uniform Bag {Bag, Uniform}  {Crayons}
Books Crayons Bag {Bag, Crayons}  {Uniform}
Uniform Crayons Pencil {Uniform}  {Bag, Crayons}
Pencil Uniform Books {Uniform, Crayons}  {Bag}
{Crayons}  {Bag, Uniform}
Confidence for these rules are as follows:
{Bag}  {Uniform, Crayons} 0.375

Bag Uniform Crayons {Bag, Uniform}  {Crayons} 0.6
Books Bag Uniform {Bag, Crayons}  {Uniform} 0.75
Bag Uniform Pencil {Uniform}  {Bag, Crayons} 0.428
Bag Pencil Books
{Uniform, Crayons}  {Bag} 0.75
Uniform Crayons Bag
Bag Pencil Books {Crayons}  {Bag, Uniform} 0.75
Crayons Uniform Bag
If minconf is 0.7, then we have discovered the
Books Crayons Bag
Uniform Crayons Pencil following rules…
People who buy a school bag and a set of

crayons are likely to buy school
uniform.
Bag Uniform Crayons
Books Bag Uniform People who buy school uniform and a set
Bag Uniform Pencil of crayons are likely to buy a school
Bag Pencil Books
bag.
Uniform Crayons Bag
Bag Pencil Books
Crayons Uniform Bag People who buy just a set of crayons are
Books Crayons Bag likely to buy a school bag and school
Uniform Crayons Pencil uniform as well.
Generalized Association Rules
Since customers can buy any number of items in one transaction,
the transaction relation would be in the form of a list of individual
purchases.
Bill No. Date Item

15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
A transaction for the purposes of data mining is obtained by
performing a GROUP BY of the table over various fields.
Bill No. Date Item

15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
A GROUP BY over Bill No. would show frequent buying patterns
across different customers.
A GROUP BY over Date would show frequent buying patterns
across different days.
Bill No. Date Item

15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
Classification & Prediction
 Classification and Prediction
 Predict categorical class labels (discrete or nominal)
 Classifies data based on the training set and the
values (class labels) in classifying unseen data
 Typical application
 Credit approval
 Target marketing
 Medical diagnosis
 Fraud detection
Classification-a Two Steps Process
 Model construction: describing a set of predetermined
classes
 Each tuple/sample is assumed to belong to a predefined class
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
▪ The known label of test sample is compared with the classified result
from model
▪ Accuracy rate is the percentage of test set sampes that are correctly
classified by the model
▪ Test set is independent of training set, otherwise overfitting will occur
 If the accuracy is acceptable, use the model to classify the
unseen tuple/sample
Process(1): Model Construction
Algorithm
Training to construct
Data Classifier
(Model)
Name Rank Years Tenured

Mike Assistant Prof 3 No Classifier
(Model)
Mary Assistant Prof 7 Yes
Bill Professor 2 Yes IF rank = ‘Professor’ OR
Jim Associate Prof 7 Yes years>6
THEN tenured = ‘yes’
Dave Assistant Prof 6 No
Anne Associate Prof 3 no
Process(1): Model Construction
Classifier
(Model)
Training
Data Unseen
Data
Name Rank Years Tenured
(Jeff, Professor, 4)
Tom Assistant Prof 2 No
Lisa Associate Prof 7 No Tenured?
George Professor 5 Yes
YES!
Joseph Assistant Prof 7 Yes
Issues: Classification Methods
Evaluation
 Accuracy
 Classifier accuracy: predicting class label
 Predictor accuracy: guessing value of predicted attributes
 Speed
 Time to construct the model (training time)
 Time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk resident databases
 Interpretability
 Understanding and insight provided by the model
Cluster Analysis
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar data
object into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 Stand alone tool to get insight into data distribution
 Preprocessing step for other algorithm
Clustering: Multidisciplinary Efforts
 Pattern recognition
 Spatial Data Analysis
 Thematic maps in GIS by clustering feature spaces
 Image processing
 Economic science (market research)
 WWW
 Document classification
Examples of Clustering Application
 Marketing: help marketers discover distinct group
in their customer bases, and then use this
knowledge to develop targeted marketing
programs
 Land use: identification of areas of similar land use
in an earth observation database
 Insurance: identifying groups of bike insurance
policy holders with high average claim cost
 …
Clustering Quality
 Good clustering produces clusters with
 High intra-class similarity
 Low inter-classes similarity
 Depends on the similarity measure used by the

clustering algorithm and the algorithm itself
Clustering Algorithms
 There are many approaches in developing clustering
algorithms
 Well known approaches
 Partitioning approach (instance based learning)
▪ Construct various partitions and evaluate them by some criterion,
e.g. k-means.
 Hierarchical clustering
▪ Create a hierarchical decomposition of the set of data using
some criterion
▪ Two main groups
▪ Agglomerative (bottom-up)
▪ Divisive (top-down)
Example
Iterative partitional clustering:
Given n elements x1, x2, … xn, and k clusters, each with a center.
1. Assign each element to its closest cluster center
2. After all assignments have been made, compute the cluster centroids for
each of the cluster
3. Repeat the above two steps with the new centroids until the algorithm
converges
Classification vs. Clustering
A Different Perspective!!!
Given a set of data elements:
Classification maps each data element to one of a set of pre-determined

classes based on the difference among data elements belonging to different
classes
Clustering groups data elements into different groups based on the

similarity between elements within a single group
Learning Algorithms
 Two major streams
 Supervised Learning  Classification
 The training data is accompanied by labels indicating the
class observations
 New data is classified based on the training set
 Unsupervised Learning  Clustering

 The class labels of training data is unknown
 Given a set of training data, find the existence of classes or
cluster in the data
Example of Learning Algorithms
 Supervised Learning
 Decision Tree
▪ ID3, C4.5  Information gain
▪ CART  GINI index
 Probabilistic and Statistical Method
▪ Bayesian Classifier
▪ Naïve Bayesian
▪ Bayesian Belief Network
 Neural Network
 …
 Unsupervised Learning
 k-means
 Unsupervised Bayesian Learning
 Competitive learning
 …

Data Mining Functionalities: Delivered By: Nur Fatih

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Mining Functionalities: Delivered By: Nur Fatih

Încărcat de

Drepturi de autor:

Formate disponibile

DATA MINING FUNCTIONALITIES

Given minimum required support s as interestingness criterion:

Interesting 1-element item-sets:

Interesting 3-element item-sets:

An association is a rule of the form: if X then Y.

For any rule if X Y Y  X, then X and Y are called

The support for a rule R is the ratio of the number of occurrences

The confidence of a rule X Y, is the ratio of the number of

Bag Uniform Crayons

Association rules are of the form

Bag Uniform Crayons Which are directional…

1. Use apriori to generate frequent

The frequent itemset {Bag, Uniform,

Confidence for these rules are as follows:

{Bag}  {Uniform, Crayons} 0.375

People who buy a school bag and a set of

Bill No. Date Item

Bill No. Date Item

Bill No. Date Item

Name Rank Years Tenured

 Depends on the similarity measure used by the

Iterative partitional clustering:

Given a set of data elements:

Classification maps each data element to one of a set of pre-determined

Clustering groups data elements into different groups based on the

 Unsupervised Learning  Clustering

S-ar putea să vă placă și