Sunteți pe pagina 1din 5

DISCRETIZATION AND CONCEPT HIERARCHY

GENERATION
Discretization:
Types of attributes:
Nominal values from an unordered set, e.g.,
color, profession
Ordinal values from an ordered set, e.g.,
military or academic rank
Continuous real numbers, e.g., integer or real
numbers
Discretization:
Divide the range of a continuous attribute into
intervals
Reduce data size by discretization
Discretization and Concept Hierarchy:
Discretization
Reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals
Interval labels can then be used to replace actual
data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an
attribute
Concept hierarchy:
Concept hierarchy formation
Recursively reduce the data by collecting and
replacing low level concepts (such as numeric
values for age) by higher level concepts (such as

young, middle-aged, or senior)


Detail lost
More meaningful
Easier to interpret
Mining becomes easier
Several concept hierarchies can be defined for the
same attribute
Manual / Implicit

Discretization and Concept Hierarchy Generation for


Numeric Data:
Typical methods:
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
2
merging
Segmentation by natural partitioning
All the methods can be applied recursively
Techniques:
Binning
Distribute values into bins
Replace by bin mean / median
Recursive application leads to concept
hierarchies
Unsupervised technique
Histogram Analysis
Data Distribution Partition
Equiwidth (0-100], (100-200],
Equidepth
Recursive

Minimum Interval size

Unsupervised
Cluster Analysis
Clusters form nodes of concept hierarchy
Can decompose / combine
Lower level / higher level of hierarchy

Entropy-Based Discretization:
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the expected
information requirement after partitioning is

Entropy is calculated based on class distribution of


the samples in the set. Given m classes, the entropy
of S1 is

where pi is the probability of class i in S1


The boundary that minimizes the expected
information requirement over all possible boundaries
is selected as a binary discretization
The process is recursively applied to partitions
obtained until some stopping criterion is met
Reduces data size
Class information is considered
Improves accuracy
Interval Merging by 2 Analysis:
ChiMerge
Bottom-up approach

find the best neighbouring intervals and


merges them to form larger intervals
Supervised

If two adjacent intervals have similar

distribution of classes they can be merged


Initially each value is in a separate interval
2 tests are performed for adjacent intervals.
Those with least values are merged
Can be repeated
Stopping condition (Threshold, Number of
intervals)

Segmentation by Natural Partitioning:

A simply 3-4-5 rule can be used to segment numeric


data into relatively uniform, natural intervals.

If an interval covers 3, 6, 7 or 9 distinct values


at the most significant digit, partition the range
into 3 equi-width intervals

If it covers 2, 4, or 8 distinct values at the most


significant digit, partition the range into 4
intervals

If it covers 1, 5, or 10 distinct values at the


most significant digit, partition the range into 5
intervals
Outliers could be present
Consider only the majority values
th
th
5 percentile 95 percentile
Example of 3-4-5 Rule
Concept Hierarchy Generation for Categorical Data:
Specification of a partial ordering of attributes
explicitly at the schema level by users or experts
User / Expert defines hierarchy
Street < city < state < country
Specification of a portion of a hierarchy by explicit
data grouping
Manual

Intermediate level information specified


Industrial, Agricultural..

Specification of a set of attributes but not their partial

ordering
Automatically inferring the hierarchy
Heuristic rule

High level concepts contain a smaller number


of values
Specification of only a partial set of attributes
Embedding data semantics
Attributes with tight semantic connections are
pinned together

S-ar putea să vă placă și