Sunteți pe pagina 1din 62

ICS583 Non-Parametric density estimation 2

Acknowledgement
The material in these slides are based on:
R. O.Duda,P. E. Hart, and D.G.Stork, Pattern Classication. John
Wiley & Sons, 2nd ed., 2001.
Pattern Recognition (4th Edition) by Sergios Theodoridis,
Konstantinos Koutroumbas and Konstantinos Koutroumbas, 2009.
Pattern Recognition and Machine Learning, C. M. Bishop, Springer,
2006.
Lectures of Prof. Andrew W. Moore, School of Computer Science,
Carnegie Mellon University
And other internet resources
ICS583 Non-Parametric density estimation 3
Non-Parametric Density Estimation
Introduction
Histogram-based estimation
Parzen windows estimation
Nearest neighbor estimation
ICS583 Non-Parametric density estimation 4
Introduction
In supervised learning we assumed that the parametric
forms of underlying (class conditional) density functions
were known.
However, in practical pattern recognition this assumption
may be doubtful.
All Parametric densities are unimodal (have a single local
maximum), whereas many practical problems involve
multi-modal densities
Non-parametric techniques can be used with arbitrary
distributions and without knowing the parametric form of
the underlying (class conditional) densities.
ICS583 Non-Parametric density estimation 5
Density Estimation and Parzen Window
Types of non-parametric methods
1. Estimation of density functions p (x|e
i
) using
sample patterns.
2. Estimation of a posteriori probabilities P(e
i
|x)
directly based on sample patterns or prototypes.
We will consider two approaches for both types of
non-parametric methods above:
1. Parzen-window-based;
2. k-Nearest neighbor-based;


ICS583 Non-Parametric density estimation 6
Density estimation
Density estimation - estimating the probability density function
p(x) based on a given set of training samples D = {x
1
,...,x
n
}.
The estimated density is denoted by p(x).
Assume that the training samples are independent and identically
distributed (i.i.d.) - each has the same probability distribution as
the others and all are mutually independent and distributed
according to p(x).
The difference between parametric and non-parametric
estimation:
1. In the non-parametric case we try to estimate a function of the
distribution p(x) instead of a parameter vector.
2.We have a finite number of training samples meaning that
there will be some errors in the function estimation.
ICS583 Non-Parametric density estimation 7
Histograms
Histograms are the simplest way to density estimation.
The feature space is divided into m equal sized cells or bins
B
i
.
The number of the training samples n
i
, i=1,...n falling to
each bin is computed and the estimate within each bin is
p(x) = n
i
/V
n
,
where V is the volume of the cell. (All cells have equal
volume so the index is not needed.)
The histogram estimate is not a very good way to estimate
densities, especially when there are many features. It leads
to discontinuous density estimates.

ICS583 Non-Parametric density estimation 8
Histograms (cont.)
ICS583 Non-Parametric density estimation 9
Density estimation







In words : Place a segment of length h at x and count points inside it.
If is continuous: as , if

2
- ,
1
) ( ) (
total
in

h
x x
N
k
h
x p x p
N
h k
N
k
P
N
N
N
s =
~
2

2
h
x

h
x

+
x

) (x p ) ( ) ( x p x p

N
0 , , 0
N
k
k h
N
N N
ICS583 Non-Parametric density estimation 10
Density estimation
Consider the problem of selecting k labeled balls without
replacement from an urn containing n balls.
In how many different ways may we select those k balls?



combinations (subsets) containing k elements.
Where n! = n (n 1) 2 1, (0! = 1)
ICS583 Non-Parametric density estimation 11
Density estimation
Assume that we wish to estimate the value of the density function p at x
based on training samples x
1
, . . . , x
n
.
If we think a region R around x, then the probability that a training
sample x
j
will fall in the region R is
----- (1)

P is an averaged version of the density function p(x).
Assume that there are k (of n) training samples in the region R.
The probability of k of the training samples falling into region R is
given by the binomial density

----- (2)
ICS583 Non-Parametric density estimation 12
Density estimation
(4) V ) x ( p ' dx ) ' x ( p
}
9
~
The expected value for k is
----- (3)
Estimation of the E[k] by the observed k leads to the estimate P =
k/n.
Therefore, the ratio k/n is a good estimate for the probability P and
hence for the density function p.
Assume p(x) is continuous and that the region R is so small that p
does not vary significantly within it, we can write:


Where x is a point within R and V the volume enclosed by R
Combining equations 1, 3, and 4
---- (5)
ICS583 Non-Parametric density estimation 13
V
n / k
) x ( p ~
ICS583 Non-Parametric density estimation 14
Density estimation
Variance
The smaller the h the higher the variance


h=0.1, N=1000 h=0.8, N=1000
ICS583 Non-Parametric density estimation 15
Density estimation

h=0.1, N=10000
The higher the N the better the accuracy
ICS583 Non-Parametric density estimation 16
Example :

ICS583 Non-Parametric density estimation 17
CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the highest the
number of points, N, the better the resulting estimate.
If in the one-dimensional space an interval, filled with N
points, is adequate (for good estimation), in the two-
dimensional space the corresponding square will require
N
2
and in the -dimensional space the -dimensional cube
will require N

points.
The exponential increase in the number of necessary
points in known as the curse of dimensionality. This is a
major problem one is confronted with in high
dimensional spaces.
ICS583 Non-Parametric density estimation 18
Density estimation
Condition for convergence
The fraction k/(nV) is a space averaged value of p(x). p(x)
is obtained only if V approaches zero.



This is the case where no samples are included in R: it is an
uninteresting case!


In this case, the estimate diverges: it is an uninteresting
case!
fixed) n (if 0 ) x ( p lim
0 k , 0 V
= =
=
=
=
) x ( p lim
0 k , 0 V
ICS583 Non-Parametric density estimation 19
Density estimation
The volume V needs to approach 0 anyway if we
want to use this estimation

Practically, V cannot be allowed to become
small since the number of samples is always
limited

One will have to accept a certain amount of
variance in the ratio k/n

ICS583 Non-Parametric density estimation 20
Unlimited number of samples
Theoretically, if an unlimited number of samples is
available, we can circumvent this difficulty
To estimate the density of x, we form a sequence of regions
R
1
, R
2
,containing x: the first region contains one sample,
the second two samples and so on.
Let V
n
be the volume of R
n
, k
n
the number of samples
falling in R
n
and p
n
(x) be the n
th
estimate for p(x):

p
n
(x) = (k
n
/n)/V
n
(7)

ICS583 Non-Parametric density estimation 21
Unlimited number of samples
( )
lim ( )
n n
p x p x

=
For
Three conditions are required



First condition assures P/V p(x)
Second condition assures that the frequency ratio will
converge to the probability
Third condition states that the number of samples in region
R
n
is small compared to number of samples. This is
required for p
n
(x) to converge.
1. lim 0
2. lim
3. lim / 0
n n
n n
n n
V
k
k n

=
=
=
ICS583 Non-Parametric density estimation 22
Unlimited number of samples
) x ( p ) x ( p
n
n

There are two ways of obtaining the sequence of


regions that satisfy these constraints.
1. Shrink an initial region by specifying V
n
as some
function of n. where V
n
= 1/\n
This is called the Parzen-window estimation
method


2. Specify k
n
as some function of n and let V
n
grow
until it encloses k
n
neighbors of x. k
n
= \n.
This is called the k
n
-nearest neighbor estimation
method

ICS583 Non-Parametric density estimation 23
ICS583 Non-Parametric density estimation 24
Parzen windows - An example
Assume that the region R
n
is a d-dimensional hypercube.
If h
n
is the length of the side of the hypercube, its volume is
given by V
n
= h
n
d

We can obtain an analytic expression for k
n
the number of
samples falling into the hypercube by defining the
following window function:



((x-x
i
)/h
n
) is equal to unity if x
i
falls within the hypercube
of volume V
n
centered at x and equal to zero otherwise.
ICS583 Non-Parametric density estimation 25
Parzen windows - An example
1
,
n
i
n
i
n
x x
k
h

=
| |

=
|
\ .

It follows that ((x - x


i
)/h
n
) = 1 if x
i
falls in the hypercube
of volume V
n
centered at x. And ((x - x
i
)/h
n
) = 0
otherwise.
The number of samples in this hypercube is therefore given
by

By substituting k
n
in equation (7), we obtain the following
estimate:

P
n
(x) estimates p(x) as an average of functions of x and the
samples (x
i
) (i = 1, ,n).
These functions can be general!
1
1 1
( )
n
n i
n
i
n n n
k x x
p x
nV n V h

=
| |

= =
|
\ .

ICS583 Non-Parametric density estimation 26


Parzen windows
1
1 1
( )
n
i
n
i
n n
x x
p x
n V h

=
| |

=
|
\ .

The Parzen-window density estimate using n training


samples and the window function is defined by




The estimate p
n
(x) is an average of (window) functions.
Usually the window function has its maximum at the origin
and its values become smaller when we move further away
from the origin. Then each training sample is contributing
to the estimate in accordance with its distance from x.
ICS583 Non-Parametric density estimation 27
Illustration
|
|
.
|

\
|

=
=
=
n
i
n i
1 i
n
n
h
x x

h
1
n
1
) x ( p
The behavior of the Parzen -window method

Case where p(x) N(0,1)
Let (u) = (1/\(2t) exp (-u
2
/2) and h
n
= h
1
/\n (n>1)
(h
1
: known parameter)
Thus:



is an average of normal densities centered at the samples x
i
.
ICS583 Non-Parametric density estimation 28
Parzen windows
) 1 , x ( N ) x x ( e
2
1
) x x ( ) x ( p
1
2
1
2 / 1
1 1
= =

Numerical results:
For n = 1 and h
1
=1



For n = 10 and h = 0.1, the contributions of the
individual samples are clearly observable !
ICS583 Non-Parametric density estimation 29
Parzen windows
ICS583 Non-Parametric density estimation 30
Parzen windows
ICS583 Non-Parametric density estimation 31
Analogous results are also obtained in two dimensions as
illustrated:
ICS583 Non-Parametric density estimation 32
Parzen windows
ICS583 Non-Parametric density estimation 33
Case where p(x) =
1
.U(a,b) +
2
.T(c,d) (unknown density)
(mixture of a uniform and a triangle density)
ICS583 Non-Parametric density estimation 34
Parzen windows
ICS583 Non-Parametric density estimation 35
Parzen windows

ICS583 Non-Parametric density estimation 36
Parzen windows

ICS583 Non-Parametric density estimation 37
Parzen windows example
Question: Given a set of five data points x
1
= 2, x
2
=
2.5, x
3
= 3, x
4
= 1 and x
5
= 6, find Parzen probability
density function (pdf) estimates at x = 3, using the
Gaussian function with o = 1 as window function?

ICS583 Non-Parametric density estimation 38
Parzen windows example
x
1

1
2

(
1
)
2
2
=
1
2

(23)
2
2
= 0.242
x
2

1
2

(2.53)
2
2
= 0.3521
x
3

1
2

(33)
2
2
= 0.3989
x
4

1
2

(13)
2
2
= 0.054
x
5

1
2

(63)
2
2
= 0.0044
P(x=3) = (0.242 + 0.3521+
0.3989+0.054+0.0044)/5 = 0.2103
ICS583 Non-Parametric density estimation 39
Parzen windows example

ICS583 Non-Parametric density estimation 40
Parzen windows

ICS583 Non-Parametric density estimation 41
Parzen windows

ICS583 Non-Parametric density estimation 42
Parzen windows notes
The window width h
a
(volume, V
a
) is the most
critical parameter in Parzen window approach.
It is selected by cross-validation approach where a
validation set or portion of the training set is used
to form a validation set.
The classifier is trained using different values of h
a
.
The h
a
that results in the smallest error in the validation
set is selected as the most optimal one.
This technique is normally used with algorithms that
has parameters to select from.
ICS583 Non-Parametric density estimation 43
Classification example
In classifiers based on Parzen-window estimation:

We estimate the densities for each category and
classify a test point by the label corresponding to the
maximum posterior

The decision region for a Parzen-window classifier
depends upon the choice of window function as
illustrated in the following figure.
ICS583 Non-Parametric density estimation 44
Parzen windows
ICS583 Non-Parametric density estimation 45
Example of Nearest Neighbor Rule
Two class problem: yellow triangles and blue
squares. Circle represents the unknown sample x
and as its nearest neighbor comes from class
1
, it
is labeled as class
1
.
ICS583 Non-Parametric density estimation 46
Example of k-NN rule with k = 3
There are two classes: yellow triangles and blue
squares. The circle represents the unknown sample
x and as two of its nearest neighbors come from
class
2
, it is labeled class
2
.
The number k should be:
1) large to minimize probability of misclassifying x.
2) small (with respect to no of samples) so that points are
close enough to x to give an accurate estimate of the true
class of x.
ICS583 Non-Parametric density estimation 47
K
n
- Nearest neighbor estimation
Goal: a solution for the problem of the unknown best
window function
Let the cell volume be a function of the training data
Center a cell about x and let it grow until it captures k
n
samples
(k
n
= f(n))
k
n
are called the k
n
nearest-neighbors of x
possibilities can occur:
Density is high near x; therefore the cell will be small which
provides a good resolution
Density is low; therefore the cell will grow large and stop until
higher density regions are reached
We can obtain a family of estimates by setting k
n
=k
1
/\n
and choosing different values for k
1

ICS583 Non-Parametric density estimation 48
Illustration


For k
n
= \n = 1 ; the estimate becomes:

P
n
(x) = k
n
/ n.V
n

= 1 / V
1

=1 / 2|x-x
1
|
ICS583 Non-Parametric density estimation 49
K
n
- Nearest neighbor estimation
ICS583 Non-Parametric density estimation 50
K
n
- Nearest neighbor estimation
ICS583 Non-Parametric density estimation 51
K
n
- Nearest neighbor estimation

ICS583 Non-Parametric density estimation 52
K
n
- Nearest neighbor estimation
K= 10, N = 200
ICS583 Non-Parametric density estimation 53
K
n
- Nearest neighbor estimation
Estimation of a-posteriori probabilities
Goal: estimate P(e
i
| x) from a set of n labeled samples
Lets place a cell of volume V around x and capture k
samples
k
i
samples amongst k turned out to be labeled e
i
then:
p
n
(x, e
i
) = k
i
/n.V
An estimate for p
n
(e
i
| x) is:
k
k
) , x ( p
) , x ( p
) x | ( p
i
c j
1 j
j n
i n
i n
= =
=
=
e
e
e
ICS583 Non-Parametric density estimation 54
K
n
- Nearest neighbor estimation
k
i
/k is the fraction of the samples within the
cell that are labeled e
i


For minimum error rate, the most frequently
represented category within the cell is
selected

If k is large and the cell sufficiently small,
the performance will approach the best
possible
ICS583 Non-Parametric density estimation 55
Nearest Neighbor
The nearest neighbor rule
Let D
n
= {x
1
, x
2
, , x
n
} be a set of n labeled prototypes
Let x e D
n
be the closest prototype to a test point x
The nearest-neighbor rule for classifying x is
to assign it the label associated with x
ICS583 Non-Parametric density estimation 56
Nearest Neighbor
Nearest-neighbor rule is a sub-optimal procedure
The nearest-neighbor rule leads to an error rate greater
than the minimum possible: the Bayes rate
If the number of prototypes is large (unlimited), the error
rate of the nearest-neighbor classifier is never worse than
twice the Bayes rate (it can be demonstrated!)
If n , it is always possible to find x sufficiently
close so that:
P(e
i
| x) ~ P(e
i
| x)

ICS583 Non-Parametric density estimation 57
Bounds on Error Rate of k-Nearest Neighbor Rule
As k gets larger the error rate equals the Bayes rate
k should be a small fraction of the total number of
samples
ICS583 Non-Parametric density estimation 58
Nearest Neighbor
If P(e
m
| x) ~ 1, then the nearest neighbor
selection is almost always the same as the Bayes
selection
ICS583 Non-Parametric density estimation 59
59
Nearest Neighbor
ICS583 Non-Parametric density estimation 60
Nearest Neighbor
N-n classifier effectively partitions the feature space into
cells consisting of all points closer to a given training point
x than to any other training points.
All points in such a cell are thus labeled by the category of
the training point Voronoi tesselation of the space
ICS583 Non-Parametric density estimation 61
The k Nearest-Neighbor Rule
Classify x by assigning it the label most frequently
represented among the k nearest samples and use a
voting scheme
ICS583 Non-Parametric density estimation 62
Example
Example:k = 3 (odd value) and x
= (0.10, 0.25)
t
Closest vectors to x with their
labels are:
{(0.10, 0.28, e
2
); (0.12, 0.20, e
2
);
(0.15, 0.35,e
1
)}
One voting scheme assigns the
label e
2
to x since e
2
is the most
frequently represented
Prototypes Labels
(0.15, 0.35)
(0.10, 0.28)
(0.09, 0.30)
(0.12, 0.20)
e
1
e
2
e
5
e
2

S-ar putea să vă placă și