Data

Data Mining:
Concepts and
Techniques
(3rd ed.)
Chapter 3
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Summary
2
Data Quality: Why Preprocess the

Data?
There are many factors comprising data quality.

Measures for data quality are:
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable
Consistency: some modified but some not
Timeliness: timely update?
Believability: how much data are trusted by users
Interpretability: how easily the data can be

understood?
3
Data cleaning
Data integration
Fill in missing values, smooth noisy data, identify or

remove outliers, and resolve inconsistencies
Integration of multiple databases, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Aggregation
4
Forms of Data
Preprocessing
February 19, 2008
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
6
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,

e.g., instrument faulty, human or computer error, transmission
error
incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data
noisy: containing noise, errors, or outliers
e.g., Occupation= (missing data)

e.g., Salary=10 (an error)
inconsistent: containing discrepancies in codes or names,

e.g.,
Age=42, Birthday=03/07/2010
Was rating 1, 2, 3, now rating A, B, C
Intentional (e.g., disguised missing data)
Jan. 1 as everyones birthday?

7
Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus

deleted
data not entered due to misunderstanding
certain data may not be considered important at the

time of entry
not register history or changes of the data
Missing data may need to be inferred

8
How to Handle Missing

Data?
Ignore the tuple: usually done when class label is

missing (when doing classification)not effective when
the % of missing values per attribute varies considerably
Fill in the missing value manually: tedious + infeasible
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the

same class: smarter
the most probable value: inference-based such as

Bayesian formula or decision tree
9
Noisy Data
Noise: random error or variance in a measured

variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
10
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
11
Binning Methods for Data

Smoothing
12
Data Cleaning as a Process
Data discrepancy detection

Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potters Wheels)
13
Exercise
14
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
15
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Entity identification problem:
Identify real world entities from multiple data sources, e.g.,

Bill Clinton = William Clinton, Cust-id = Cust-#
Data value conflicts
For the same real world entity, attribute values from

different sources are different
Possible reasons: different representations, different

scales, e.g., metric vs. British units
16
Handling Redundancy in Data

Integration
Redundant data occur often when integrating multiple

databases
The same attribute or object may have different

names in different databases
One attribute may be a derived attribute in

another table, e.g., age
Redundant attributes may be detected by correlation

analysis and covariance analysis
Careful integration of the data from multiple sources

may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
17
Correlation Analysis (Nominal Data)
2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
Expected = (count(A=ai)*count(B=bj))/n
The 2 statistic tests the hypothesis that A and B are independent,

i.e.., there is no correlation between them
The test is based on significance level with (r-1)(c-1) degrees of

freedom
If the hypothesis can be rejected, then we say that A and B are

statistically correlated
The larger the 2 value, the more likely the variables are related
Correlation does not imply causality
18
Chi-Square Calculation: An
Example
male
female
Sum
(row)
fiction
250(90)
200(360)
450
non-fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
2 (chi-square) calculation (numbers in parenthesis

are expected counts calculated based on the data
distribution in the two categories)
(250 90) 2 (50 210) 2 (200 360) 2 (1000 840) 2

507.93
90
210
360
840
2
19
Chi-Square Calculation: An
Example
male
female
Sum
(row)
fiction
250(90)
200(360)
450
non-fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
For this 2*2 table, the degrees of freedom are (2-1)(2-1)=1. For 1
degree of freedom, the 2 value needed to reject the hypothesis at
0.001 significance level is 10.828 (using 2 distribution table)
Since the computed value is above this, we can reject the hypothesis
that gender and preferred reading are independent
We can conclude that the two attributes are strongly correlated for
the given group of people
20
Correlation Analysis (Numeric Data)
Correlation coefficient (also called Pearsons product

moment coefficient)
i 1 (ai A)(bi B)
n
rA, B
(n 1) A B
i 1
(ai bi ) n A B
(n 1) A B
where n is the number of tuples,

and
are the respective
B
A the respective
means of A and B, A and B are
standard
deviation of A and B, and (a ibi) is the sum of the AB crossproduct.
If rA,B > 0, A and B are positively correlated (As values increase as

Bs). The higher the value, the stronger the correlation.
rA,B = 0: independent; rAB < 0: negatively correlated

21
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples,
and
are the respective mean or
expected values of A and B, AAand BB
are the respective standard
deviation of A and B.
Positive covariance: If CovA,B > 0, then if A is larger than its expected

value, B is also likely to be larger than its expected value.
Negative covariance: If CovA,B < 0 then if A is larger than its expected

value, B is likely to be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
22
Co-Variance: An Example
It can be simplified in computation as
Suppose two stocks A and B have the following values in one

week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends,

will their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (25+38+510+411+614)/5 4 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
Exercise
24
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
25
Data Transformation
Mapping the entire set of values of a given attribute to a new set of replacement
values so that each old value can be identified with one of the new values
Strategies for data transformation include the following
Smoothing: Remove noise from data. Techniques include binning, regression

and clustering.
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: summery or aggregation operations are applied to the data.
Normalization: Scaled to fall within a smaller, specified range such as-1.0 to

1.0
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: raw values of numeric attributes (e.g., age) replaced by interval

labels (e.g., 0-10, 11-20, etc.) or conceptual labels (e.g., youth, adult, senior)
Concept hierarchy generation: where attributes such as street can be

generalized to higher level concepts , like city or country.
26
Normalization
Min-max normalization: performs linear transformation on the original

data. [new_minA, new_maxA]
v'
v minA
( new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

73,600 12,000
(1.0 0) 0 0.716
Then $73,600 is mapped to
98,000 12,000
Z-score normalization : the values of an attribute A are normalized based

on the mean and std deveation (: mean, : standard deviation):
Ex. Let = 54,000, = 16,000. Then
v'
v A
73,600 54,000
1.225
16,000
Normalization by decimal scaling: Normalizes by moving the decimal

point of values of attribute A.A value, vi of A is normalized to v by
v
v' j
10
Where j is the smallest integer such that Max(||) < 1

27
Data Discretization
Reduce the number of values for a given

continuous attribute by dividing the range
of the attribute into intervals
Interval labels can then be used to replace
actual data values
Split (topdown) vs. merge (bottomup)
Discretization can be performed recursively
on an attribute
12/29/16
Data Mining: Concepts and

Techniques
28
Why Discretization is
Used?
Reduce data size.

Transforming quantitative data to
qualitative data.
12/29/16
Data Mining: Concepts and

Techniques
29
Data Discretization Methods
Typical methods: All the methods can be applied

recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or

bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottomup merge)

30
Concept Hierarchy Generation
Concept hierarchy organizes concepts (i.e., attribute values)

hierarchically
Concept hierarchies facilitate drilling and rolling in data

warehouses to view data in multiple granularity
Concept hierarchy formation: Recursively reduce the data by

collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as youth, adult,
or senior)
Concept hierarchies can be explicitly specified by domain

experts and/or data warehouse designers
Concept hierarchy can be automatically formed for both

numeric and nominal data.
31
Concept Hierarchy Generation

for Nominal Data
Specification of a partial/total ordering of attributes

explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by

explicit data grouping
{Urbana, Champaign, Chicago} < Illinois
32
Automatic Concept Hierarchy

Generation
Some hierarchies can be automatically

generated based on the analysis of the number
of distinct values per attribute in the data set
The attribute with the most distinct values is
placed at the lowest level of the hierarchy
Exceptions, e.g., weekday, month, year
country
15 distinct values
province_or_ state
365 distinct values
city
3567 distinct values
street
674,339 distinct values

33
Exercise
For the following group of data: 200, 300,

400, 600, 1000, use the following methods
to normalize the values.
min-max normalization
z-score normalization
normalization by decimal scaling
34
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
35
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data

set that is much smaller in volume but yet produces the same
analytical results
Why data reduction? A database/data warehouse may store
terabytes of data. Complex data analysis may take a very
long time to run on the complete data set.
Data reduction strategies
Dimensionality reduction
In dimensionality reduction data encoding schemes are
applied so as to obtain a reduced or compressed
representation of the original data.
Wavelet transforms
Principal Components Analysis (PCA)
Attribute subset selection, attribute creation

Numerosity reduction
The data are replaced by alternative, smaller representations
using parametric models or non parametric models
36
Regression and Log-Linear Models

Histograms, clustering, sampling
Data cube aggregation
Data compression
In data compression transformations are applied so as to
obtain a reduced or compressed representation of the
original data
Lossless
Lossy
37
Attribute Subset Selection

ASS Reduces the data size by removing:
Redundant attributes
Irrelevant attributes
Contain no information that is useful for the

data mining task.
E.g., students' ID is often irrelevant to the task

of predicting students' GPA
38
Attribute Subset Selection

Greedy methods for attribute subset selection
39
Attribute Creation (Feature

Generation)
Create new attributes (features) that can capture

the important information in a data set more
effectively than the original one
Attribute construction can help to improve
accuracy and understanding of structure in high
dimensional data
40
Wavelet Transforms
The discrete wavelet transform (DWT) is a linear

signal processing technique that, when applied to a
data vector X, transforms it to a numerically different
vector, X , of wavelet coefficients.
The two vectors are of the same length. When
applying this technique to data reduction, we
consider each tuple as an n-dimensional data vector,
that is, X = (x1,x2,...,xn), depicting n measurements
made on the tuple from n database attributes.
Wavelet
transforms
can
be
applied
to
multidimensional data such as a data cube. This is
done by first applying the transform to the first
dimension, then to the second, and so on.
Wavelet transforms give good results on sparse or
skewed data and on data with ordered attributes.
41
Principal components analysis
Principal components analysis (PCA; also called the

Karhunen-Loeve, or K-L, method) searches for k ndimensional orthogonal vectors that can best be
used to represent the data, where k n.
The original data are thus projected onto a much
smaller space, resulting in dimensionality reduction.
PCA can be applied to ordered and unordered
attributes, and can handle sparse data and skewed
data
In comparison with wavelet transforms, PCA tends
to be better at handling sparse data, whereas
wavelet transforms are more suitable for data of
high dimensionality
42
Data Reduction 2: Numerosity

Reduction
Reduce data volume by choosing alternative,

smaller forms of data representation
43
Regression and log linear models can be used to

approximate the given data. In Linear regression the
data are modeled to fit a straight line.
Linear regression : Y = w X + b
Two regression coefficients, w and b, specify the line and
are to be estimated by using the data at hand
Log- linear models:

o Approximate
discrete
multidimensional
probability
distributions
Estimate the probability of each point (tuple) in a multi
dimensional space for a set of discretized attributes,
based on a smaller subset of dimensional combinations
Useful for dimensionality reduction and data smoothing
44
Histogram Analysis
Divide data into buckets
Partitioning rules:
40
Equal-width: equal
bucket range
30
Equal-frequency (or
equal-depth)
20
15
10
5
100000
90000
80000
70000
60000
50000
40000
0
30000
25
20000
35
10000
45
Clustering
Partition data set into clusters based on similarity,

and store cluster representation (e.g., centroid
and diameter) only
Can have hierarchical clustering and be stored in

multi-dimensional index tree structures
46
Sampling
Sampling: obtaining a small sample s to represent

the whole data set N
Key principle: Choose a representative subset of the

data
Common ways of sampling:
Simple random sample without replacement of

size (SRSWOR)
Simple random sample with replacement of size

(SRSWR)
Cluster sample
Stratified sample
47
Types of Sampling
Simple random sampling

There is an equal probability of selecting any
particular item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the
population
Stratified sampling:
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the
same percentage of the data)
Used in conjunction with skewed data
48
Sampling: With or without

Replacement
49
Data Cube Aggregation
Data Cube Aggregation
Summarize (aggregate) data based on dimensions

The resulting data set is smaller in volume, without
loss of information necessary for analysis analysis
task
Concept hierarchies may exist for each attribute,
allowing the analysis of data at multiple levels of
abstraction
50
Data Reduction 3: Data

Compression
String compression
There are extensive theories and well-tuned
algorithms
Typically lossless, but only limited manipulation is
possible without expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Dimensionality and numerosity reduction may also
be considered as forms of data compression
51
Exercise
Using the data for age below
13 15 16 16 19 20 20 21 22 22 25 25 25 25 30 33 33
35 35 35 35 36 40 45 46 52 70
Plot an equal width histogram of width 10.
Sketch examples of each of the following sampling
techniques: SRSWOR, SRSWR, cluster sampling and
stratified sampling. Use samples of size 5 and the
strata youth, middle-aged and senior.
52

Data

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data

Încărcat de

Drepturi de autor:

Formate disponibile

Data Mining:

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Major Tasks in Data Preprocessing

Data Transformation and Data Discretization

Data Quality: Why Preprocess the

There are many factors comprising data quality.

Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable

Consistency: some modified but some not

Timeliness: timely update?

Believability: how much data are trusted by users

Interpretability: how easily the data can be

Major Tasks in Data Preprocessing

Fill in missing values, smooth noisy data, identify or

Data transformation and data discretization

February 19, 2008

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Major Tasks in Data Preprocessing

Data Transformation and Data Discretization

Data in the Real World Is Dirty: Lots of potentially incorrect data,

incomplete: lacking attribute values, lacking certain

noisy: containing noise, errors, or outliers

e.g., Occupation= (missing data)

inconsistent: containing discrepancies in codes or names,

Was rating 1, 2, 3, now rating A, B, C

Intentional (e.g., disguised missing data)

Jan. 1 as everyones birthday?

Incomplete (Missing) Data

Data is not always available

E.g., many tuples have no recorded value for several

Missing data may be due to

inconsistent with other recorded data and thus

data not entered due to misunderstanding

certain data may not be considered important at the

not register history or changes of the data

Missing data may need to be inferred

How to Handle Missing

Ignore the tuple: usually done when class label is

Fill in the missing value manually: tedious + infeasible

Fill in it automatically with

a global constant : e.g., unknown, a new class?!

the attribute mean

the attribute mean for all samples belonging to the

the most probable value: inference-based such as

Noise: random error or variance in a measured

How to Handle Noisy Data?

Binning Methods for Data

Data Cleaning as a Process

Data discrepancy detection

Data scrubbing: use simple domain knowledge (e.g., postal

Data auditing: by analyzing data to discover rules and

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Major Tasks in Data Preprocessing

Data Transformation and Data Discretization

Combines data from multiple sources into a coherent store

Entity identification problem:

Identify real world entities from multiple data sources, e.g.,

Data value conflicts

For the same real world entity, attribute values from

Possible reasons: different representations, different