Pca (Data Reduction)

Data Reduction
Purpose
Obtain a reduced representation of the dataset that is much
smaller in volume, yet closely maintains the integrity of the
original data
Strategies
Data cube aggregation
Aggregation operations are applied to construct a data cube
Attribute subset selection

Irrelevant, weakly relevant, or redundant attributes are detected and
removed
Data compression
Data encoding or transformations are applied so as to obtain a reduced or
compressed representation of the original data
Numerosity reduction
Data are replaced or estimated by alternative, smaller data representations
(e.g. models)
Discretization and concept hierarchy generation
Attribute values are replaced by ranges or higher conceptual levels
Stepwise Selection
Stepwise Forward (Example)

Start with an empty reduced set
The best attribute is selected first and added to the reduced set
At each subsequent step, the best of the remaining attributes is selected and
added to the reduced set (conditioning on the attributes that are already in the
set)
Stepwise Backward (Example)
Start with the full set of attributes
At each step, the worst of the attributes in the set is removed
Combination of Forward and Backward
At each step, the procedure selects the best attribute and adds it to the set, and
removes the worst attribute from the set
Some attributes were good in initial selections but may not be good anymore
after other attributes have been included in the set
Decision Induction
Decision Tree
A mode in the form of a tree structure
Decision nodes
Each denotes a test on the corresponding attribute which is the best
attribute to partition data in terms of class distributions at the point
Each branch corresponds to an outcome of the test
Leaf nodes
Each denotes a class prediction
Can be used for attribute selection
Stepwise and Decision Tree Methods for Attribute Selection

4
Data Compression
Purpose
Lossless Compression
Apply data encoding or transformations to obtain a reduced or compressed

representation of the original data
The original data can be reconstructed from the compressed data without any
loss of information
e.g. some well-tuned algorithms for string compression
Lossy Compression
Only an approximation of the original data can be constructed from the

compressed data
e.g. wavelet transforms and principal component analysis (PCA)
Numerosity Reduction
Purpose
Parametric Methods
Reduce data volume by choosing alternative, smaller data representations

A model is used to fit data, store only the model parameters not original data
(except possible outliers)
e.g. Regression models and Log-linear models
Non-Parametric Methods
Do not use models to fit data

Histograms
Clustering
Use binning to approximate data distributions

A histogram of attribute A partitions the data distribution of A into disjoint subsets,
or buckets
Use cluster representation of the data to replace the actual data
Sampling
Represent the original data by a much smaller sample (subset) of the data
6
Sampling
Simple Random Sampling without Replacement (SRSWOR)
Simple Random Sampling with Replacement (SRSWR)
Each time a record is drawn from D, it is recorded and then placed back to D, so
it may be drawn more than once
Cluster Sampling
Draw s of the N records from dataset D (s<N), with no record can be drawn
more than once
Records in D are first divided into groups or clusters, and a random sample of
these clusters is then selected (all records in the selected clusters are included in
the sample)
Stratified Sampling
Records in D are divided into subgroups (or strata), and random sampling
techniques are then used to select sample members from each stratum
Raw Data
Stratified Sampling
Cluster Sampling
Principal Component
Analysis
Principal Component
Overview
Analysis
Suppose a dataset has n attributes. PCA searches for k (k<n)

n-D orthogonal vectors that can best be used to represent the
data
Works for numeric data only
Used when the number of attributes is large
PCA can reveal relationships that were not previously
expected
See Page no:97,98

cor(X,Y) = cov(X,Y) / [sd(X) sd(Y)].
Eigenvalues
and Eigenvectors - HMC Calculus Tut
orial.htm
x1
x2
x
3
x
9
X1,x2,x
d
At slid
otoin
2+10=86
%

Pca (Data Reduction)

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Pca (Data Reduction)

Încărcat de

Drepturi de autor:

Formate disponibile

Data Reduction

Attribute subset selection

Discretization and concept hierarchy generation

Attribute values are replaced by ranges or higher conceptual levels

Stepwise Forward (Example)

Can be used for attribute selection

Stepwise and Decision Tree Methods for Attribute Selection

Apply data encoding or transformations to obtain a reduced or compressed

Only an approximation of the original data can be constructed from the

Reduce data volume by choosing alternative, smaller data representations

Do not use models to fit data

Use binning to approximate data distributions

Simple Random Sampling without Replacement (SRSWOR)

Simple Random Sampling with Replacement (SRSWR)

Suppose a dataset has n attributes. PCA searches for k (k<n)

See Page no:97,98

S-ar putea să vă placă și