Sunteți pe pagina 1din 24

Data Reduction

Purpose
Obtain a reduced representation of the dataset that is much
smaller in volume, yet closely maintains the integrity of the
original data

Strategies
Data cube aggregation
Aggregation operations are applied to construct a data cube

Attribute subset selection


Irrelevant, weakly relevant, or redundant attributes are detected and
removed

Data compression
Data encoding or transformations are applied so as to obtain a reduced or
compressed representation of the original data

Numerosity reduction
Data are replaced or estimated by alternative, smaller data representations
(e.g. models)

Discretization and concept hierarchy generation

Attribute values are replaced by ranges or higher conceptual levels

Stepwise Selection

Stepwise Forward (Example)


Start with an empty reduced set
The best attribute is selected first and added to the reduced set
At each subsequent step, the best of the remaining attributes is selected and
added to the reduced set (conditioning on the attributes that are already in the
set)
Stepwise Backward (Example)
Start with the full set of attributes
At each step, the worst of the attributes in the set is removed
Combination of Forward and Backward
At each step, the procedure selects the best attribute and adds it to the set, and
removes the worst attribute from the set
Some attributes were good in initial selections but may not be good anymore
after other attributes have been included in the set

Decision Induction
Decision Tree
A mode in the form of a tree structure
Decision nodes
Each denotes a test on the corresponding attribute which is the best
attribute to partition data in terms of class distributions at the point
Each branch corresponds to an outcome of the test

Leaf nodes
Each denotes a class prediction

Can be used for attribute selection

Stepwise and Decision Tree Methods for Attribute Selection


4

Data Compression

Purpose

Lossless Compression

Apply data encoding or transformations to obtain a reduced or compressed


representation of the original data
The original data can be reconstructed from the compressed data without any
loss of information
e.g. some well-tuned algorithms for string compression

Lossy Compression

Only an approximation of the original data can be constructed from the


compressed data
e.g. wavelet transforms and principal component analysis (PCA)

Numerosity Reduction

Purpose

Parametric Methods

Reduce data volume by choosing alternative, smaller data representations


A model is used to fit data, store only the model parameters not original data
(except possible outliers)
e.g. Regression models and Log-linear models

Non-Parametric Methods

Do not use models to fit data


Histograms

Clustering

Use binning to approximate data distributions


A histogram of attribute A partitions the data distribution of A into disjoint subsets,
or buckets
Use cluster representation of the data to replace the actual data

Sampling

Represent the original data by a much smaller sample (subset) of the data
6

Sampling

Simple Random Sampling without Replacement (SRSWOR)

Simple Random Sampling with Replacement (SRSWR)

Each time a record is drawn from D, it is recorded and then placed back to D, so
it may be drawn more than once

Cluster Sampling

Draw s of the N records from dataset D (s<N), with no record can be drawn
more than once

Records in D are first divided into groups or clusters, and a random sample of
these clusters is then selected (all records in the selected clusters are included in
the sample)

Stratified Sampling

Records in D are divided into subgroups (or strata), and random sampling
techniques are then used to select sample members from each stratum

Raw Data

Stratified Sampling

Cluster Sampling

Principal Component
Analysis

Principal Component
Overview
Analysis

Suppose a dataset has n attributes. PCA searches for k (k<n)


n-D orthogonal vectors that can best be used to represent the
data
Works for numeric data only
Used when the number of attributes is large
PCA can reveal relationships that were not previously
expected

See Page no:97,98


cor(X,Y) = cov(X,Y) / [sd(X) sd(Y)].

Eigenvalues
and Eigenvectors - HMC Calculus Tut
orial.htm

x1

x2

x
3

x
9

X1,x2,x
d
At slid

otoin
2+10=86
%

S-ar putea să vă placă și