Documente Academic
Documente Profesional
Documente Cultură
Concepts and
Techniques
(3rd ed.)
Chapter 3
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
2011 Han, Kamber & Pei. All rights reserved.
1
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
2
Data cleaning
Data integration
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Normalization
Aggregation
4
Forms of Data
Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
6
Data Cleaning
Age=42, Birthday=03/07/2010
equipment malfunction
Noisy Data
10
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
11
12
Exercise
14
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
15
Data Integration
Data integration:
16
2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
Expected = (count(A=ai)*count(B=bj))/n
The larger the 2 value, the more likely the variables are related
18
Chi-Square Calculation: An
Example
male
female
Sum
(row)
fiction
250(90)
200(360)
450
non-fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
507.93
90
210
360
840
2
19
Chi-Square Calculation: An
Example
male
female
Sum
(row)
fiction
250(90)
200(360)
450
non-fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
For this 2*2 table, the degrees of freedom are (2-1)(2-1)=1. For 1
degree of freedom, the 2 value needed to reject the hypothesis at
0.001 significance level is 10.828 (using 2 distribution table)
Since the computed value is above this, we can reject the hypothesis
that gender and preferred reading are independent
We can conclude that the two attributes are strongly correlated for
the given group of people
20
i 1 (ai A)(bi B)
n
rA, B
(n 1) A B
i 1
(ai bi ) n A B
(n 1) A B
Correlation coefficient:
where n is the number of tuples,
and
are the respective mean or
expected values of A and B, AAand BB
are the respective standard
deviation of A and B.
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
22
Co-Variance: An Example
Exercise
24
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
25
Data Transformation
Mapping the entire set of values of a given attribute to a new set of replacement
values so that each old value can be identified with one of the new values
Attribute/feature construction
min-max normalization
z-score normalization
Normalization
v minA
( new _ maxA new _ minA) new _ minA
maxA minA
v'
v A
73,600 54,000
1.225
16,000
v
v' j
10
Data Discretization
12/29/16
28
Why Discretization is
Used?
12/29/16
29
Binning
Histogram analysis
32
15 distinct values
province_or_ state
city
street
Exercise
34
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
35
Wavelet transforms
36
Data compression
In data compression transformations are applied so as to
obtain a reduced or compressed representation of the
original data
Lossless
Lossy
37
Redundant attributes
Irrelevant attributes
38
39
40
Wavelet Transforms
43
44
Histogram Analysis
Divide data into buckets
Partitioning rules:
40
Equal-width: equal
bucket range
30
Equal-frequency (or
equal-depth)
20
15
10
5
100000
90000
80000
70000
60000
50000
40000
0
30000
25
20000
35
10000
45
Clustering
46
Sampling
Cluster sample
Stratified sample
47
Types of Sampling
49
50
String compression
There are extensive theories and well-tuned
algorithms
Typically lossless, but only limited manipulation is
possible without expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Dimensionality and numerosity reduction may also
be considered as forms of data compression
51
Exercise
Using the data for age below
13 15 16 16 19 20 20 21 22 22 25 25 25 25 30 33 33
35 35 35 35 36 40 45 46 52 70
Plot an equal width histogram of width 10.
Sketch examples of each of the following sampling
techniques: SRSWOR, SRSWR, cluster sampling and
stratified sampling. Use samples of size 5 and the
strata youth, middle-aged and senior.
52