ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 2
Acknowledgement
The material in these slides are based on:
R. O.Duda,P. E. Hart, and D.G.Stork, Pattern Classication. John
Wiley & Sons, 2nd ed., 2001.
Pattern Recognition (4th Edition) by Sergios Theodoridis,
Konstantinos Koutroumbas and Konstantinos Koutroumbas, 2009.
Pattern Recognition and Machine Learning, C. M. Bishop, Springer,
2006.
Lectures of Prof. Andrew W. Moore, School of Computer Science,
Carnegie Mellon University
And other internet resources
Non-Parametric Density Estimation
Introduction
Histogram-based estimation
Parzen windows estimation
Nearest neighbor estimation
Introduction
In supervised learning we assumed that the parametric
forms of underlying (class conditional) density functions
were known.
However, in practical pattern recognition this assumption
may be doubtful.
All Parametric densities are unimodal (have a single local
maximum), whereas many practical problems involve
multi-modal densities
Non-parametric techniques can be used with arbitrary
distributions and without knowing the parametric form of
the underlying (class conditional) densities.
Density Estimation and Parzen Window
Types of non-parametric methods
1. Estimation of density functions p (x|e
i
) using
sample patterns.
2. Estimation of a posteriori probabilities P(e
i
|x)
directly based on sample patterns or prototypes.
We will consider two approaches for both types of
non-parametric methods above:
1. Parzen-window-based;
2. k-Nearest neighbor-based;

Density estimation
Density estimation - estimating the probability density function
p(x) based on a given set of training samples D = {x
1
,...,x
n
}.
The estimated density is denoted by p(x).
Assume that the training samples are independent and identically
distributed (i.i.d.) - each has the same probability distribution as
the others and all are mutually independent and distributed
according to p(x).
The difference between parametric and non-parametric
estimation:
1. In the non-parametric case we try to estimate a function of the
distribution p(x) instead of a parameter vector.
2.We have a finite number of training samples meaning that
there will be some errors in the function estimation.
Histograms
Histograms are the simplest way to density estimation.
The feature space is divided into m equal sized cells or bins
B
i
.
The number of the training samples n
i
, i=1,...n falling to
each bin is computed and the estimate within each bin is
p(x) = n
i
/V
n
,
where V is the volume of the cell. (All cells have equal
volume so the index is not needed.)
The histogram estimate is not a very good way to estimate
densities, especially when there are many features. It leads
to discontinuous density estimates.

Histograms (cont.)
Density estimation

In words : Place a segment of length h at x and count points inside it.
If is continuous: as , if

2
- ,
1
) ( ) (
total
in

h
x x
N
k
h
x p x p
N
h k
N
k
P
N
N
N
s =
~
2

2
h
x
h
x
+
x
) (x p ) ( ) ( x p x p
N
0 , , 0
N
k
k h
N
N N
Density estimation
Consider the problem of selecting k labeled balls without
replacement from an urn containing n balls.
In how many different ways may we select those k balls?

combinations (subsets) containing k elements.
Where n! = n (n 1) 2 1, (0! = 1)
Density estimation
Assume that we wish to estimate the value of the density function p at x
based on training samples x
1
, . . . , x
n
.
If we think a region R around x, then the probability that a training
sample x
j
will fall in the region R is
----- (1)

P is an averaged version of the density function p(x).
Assume that there are k (of n) training samples in the region R.
The probability of k of the training samples falling into region R is
given by the binomial density

----- (2)
Density estimation
(4) V ) x ( p ' dx ) ' x ( p
}
9
~
The expected value for k is
----- (3)
Estimation of the E[k] by the observed k leads to the estimate P =
k/n.
Therefore, the ratio k/n is a good estimate for the probability P and
hence for the density function p.
Assume p(x) is continuous and that the region R is so small that p
does not vary significantly within it, we can write:

Where x is a point within R and V the volume enclosed by R
Combining equations 1, 3, and 4
---- (5)
V
n / k
) x ( p ~
Density estimation
Variance
The smaller the h the higher the variance

h=0.1, N=1000 h=0.8, N=1000
Density estimation

h=0.1, N=10000
The higher the N the better the accuracy
Example :

CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the highest the
number of points, N, the better the resulting estimate.
If in the one-dimensional space an interval, filled with N
points, is adequate (for good estimation), in the two-
dimensional space the corresponding square will require
N
2
and in the -dimensional space the -dimensional cube
will require N
points.
The exponential increase in the number of necessary
points in known as the curse of dimensionality. This is a
major problem one is confronted with in high
dimensional spaces.
Density estimation
Condition for convergence
The fraction k/(nV) is a space averaged value of p(x). p(x)
is obtained only if V approaches zero.

This is the case where no samples are included in R: it is an
uninteresting case!

In this case, the estimate diverges: it is an uninteresting
case!
fixed) n (if 0 ) x ( p lim
0 k , 0 V
= =
=
=
=
) x ( p lim
0 k , 0 V
Density estimation
The volume V needs to approach 0 anyway if we
want to use this estimation

Practically, V cannot be allowed to become
small since the number of samples is always
limited

One will have to accept a certain amount of
variance in the ratio k/n

Unlimited number of samples
Theoretically, if an unlimited number of samples is
available, we can circumvent this difficulty
To estimate the density of x, we form a sequence of regions
R
1
, R
2
,containing x: the first region contains one sample,
the second two samples and so on.
Let V
n
be the volume of R
n
, k
n
the number of samples
falling in R
n
and p
n
(x) be the n
th
estimate for p(x):

p
n
(x) = (k
n
/n)/V
n
(7)

( )
lim ( )
n n
p x p x
=
For
Three conditions are required

First condition assures P/V p(x)
Second condition assures that the frequency ratio will
converge to the probability
Third condition states that the number of samples in region
R
n
is small compared to number of samples. This is
required for p
n
(x) to converge.
1. lim 0
2. lim
3. lim / 0
n n
n n
n n
V
k
k n
=
=
=
) x ( p ) x ( p
n
n

There are two ways of obtaining the sequence of

regions that satisfy these constraints.
1. Shrink an initial region by specifying V
n
as some
function of n. where V
n
= 1/\n
This is called the Parzen-window estimation
method

2. Specify k
n
as some function of n and let V
n
grow
until it encloses k
n
neighbors of x. k
n
= \n.
This is called the k
n
-nearest neighbor estimation
method

Parzen windows - An example
Assume that the region R
n
is a d-dimensional hypercube.
If h
n
is the length of the side of the hypercube, its volume is
given by V
n
= h
n
d

We can obtain an analytic expression for k
n
the number of
samples falling into the hypercube by defining the
following window function:

((x-x
i
)/h
n
) is equal to unity if x
i
falls within the hypercube
of volume V
n
centered at x and equal to zero otherwise.
Parzen windows - An example
1
,
n
i
n
i
n
x x
k
h
=
| |
=
|
\ .
It follows that ((x - x

i
)/h
n
) = 1 if x
i
falls in the hypercube
of volume V
n
centered at x. And ((x - x
i
)/h
n
) = 0
otherwise.
The number of samples in this hypercube is therefore given
by

By substituting k
n
in equation (7), we obtain the following
estimate:

P
n
(x) estimates p(x) as an average of functions of x and the
samples (x
i
) (i = 1, ,n).
These functions can be general!
1
1 1
( )
n
n i
n
i
n n n
k x x
p x
nV n V h
=
| |
= =
|
\ .

Parzen windows
1
1 1
( )
n
i
n
i
n n
x x
p x
n V h
=
| |
=
|
\ .
The Parzen-window density estimate using n training

samples and the window function is defined by

The estimate p
n
(x) is an average of (window) functions.
Usually the window function has its maximum at the origin
and its values become smaller when we move further away
from the origin. Then each training sample is contributing
to the estimate in accordance with its distance from x.
Illustration
|
|
.
|
\
|
=
=
=
n
i
n i
1 i
n
n
h
x x

h
1
n
1
) x ( p
The behavior of the Parzen -window method

Case where p(x) N(0,1)
Let (u) = (1/\(2t) exp (-u
2
/2) and h
n
= h
1
/\n (n>1)
(h
1
: known parameter)
Thus:

is an average of normal densities centered at the samples x
i
.
Parzen windows
) 1 , x ( N ) x x ( e
2
1
) x x ( ) x ( p
1
2
1
2 / 1
1 1
= =
Numerical results:
For n = 1 and h
1
=1

For n = 10 and h = 0.1, the contributions of the
individual samples are clearly observable !
Parzen windows
Parzen windows
Analogous results are also obtained in two dimensions as
illustrated:
Parzen windows
Case where p(x) =
1
.U(a,b) +
2
.T(c,d) (unknown density)
(mixture of a uniform and a triangle density)
Parzen windows
Parzen windows

Parzen windows

Parzen windows example
Question: Given a set of five data points x
1
= 2, x
2
=
2.5, x
3
= 3, x
4
= 1 and x
5
= 6, find Parzen probability
density function (pdf) estimates at x = 3, using the
Gaussian function with o = 1 as window function?

x
1

1
2
(
1
)
2
2
=
1
2
(23)
2
2
= 0.242
x
2

1
2
(2.53)
2
2
= 0.3521
x
3

1
2
(33)
2
2
= 0.3989
x
4

1
2
(13)
2
2
= 0.054
x
5

1
2
(63)
2
2
= 0.0044
P(x=3) = (0.242 + 0.3521+
0.3989+0.054+0.0044)/5 = 0.2103

Parzen windows

Parzen windows

Parzen windows notes
The window width h
a
(volume, V
a
) is the most
critical parameter in Parzen window approach.
It is selected by cross-validation approach where a
validation set or portion of the training set is used
to form a validation set.
The classifier is trained using different values of h
a
.
The h
a
that results in the smallest error in the validation
set is selected as the most optimal one.
This technique is normally used with algorithms that
has parameters to select from.
Classification example
In classifiers based on Parzen-window estimation:

We estimate the densities for each category and
classify a test point by the label corresponding to the
maximum posterior

The decision region for a Parzen-window classifier
depends upon the choice of window function as
illustrated in the following figure.
Parzen windows
Example of Nearest Neighbor Rule
Two class problem: yellow triangles and blue
squares. Circle represents the unknown sample x
and as its nearest neighbor comes from class
1
, it
is labeled as class
1
.
Example of k-NN rule with k = 3
There are two classes: yellow triangles and blue
squares. The circle represents the unknown sample
x and as two of its nearest neighbors come from
class
2
, it is labeled class
2
.
The number k should be:
1) large to minimize probability of misclassifying x.
2) small (with respect to no of samples) so that points are
close enough to x to give an accurate estimate of the true
class of x.
K
n
- Nearest neighbor estimation
Goal: a solution for the problem of the unknown best
window function
Let the cell volume be a function of the training data
Center a cell about x and let it grow until it captures k
n
samples
(k
n
= f(n))
k
n
are called the k
n
nearest-neighbors of x
possibilities can occur:
Density is high near x; therefore the cell will be small which
provides a good resolution
Density is low; therefore the cell will grow large and stop until
higher density regions are reached
We can obtain a family of estimates by setting k
n
=k
1
/\n
and choosing different values for k
1

Illustration

For k
n
= \n = 1 ; the estimate becomes:

P
n
(x) = k
n
/ n.V
n

= 1 / V
1

=1 / 2|x-x
1
|
K
n
K
n
K
n

K
n
K= 10, N = 200
K
n
Estimation of a-posteriori probabilities
Goal: estimate P(e
i
| x) from a set of n labeled samples
Lets place a cell of volume V around x and capture k
samples
k
i
samples amongst k turned out to be labeled e
i
then:
p
n
(x, e
i
) = k
i
/n.V
An estimate for p
n
(e
i
| x) is:
k
k
) , x ( p
) , x ( p
) x | ( p
i
c j
1 j
j n
i n
i n
= =
=
=
e
e
e
K
n
k
i
/k is the fraction of the samples within the
cell that are labeled e
i

For minimum error rate, the most frequently
represented category within the cell is
selected

If k is large and the cell sufficiently small,
the performance will approach the best
possible
Nearest Neighbor
The nearest neighbor rule
Let D
n
= {x
1
, x
2
, , x
n
} be a set of n labeled prototypes
Let x e D
n
be the closest prototype to a test point x
The nearest-neighbor rule for classifying x is
to assign it the label associated with x
Nearest Neighbor
Nearest-neighbor rule is a sub-optimal procedure
The nearest-neighbor rule leads to an error rate greater
than the minimum possible: the Bayes rate
If the number of prototypes is large (unlimited), the error
rate of the nearest-neighbor classifier is never worse than
twice the Bayes rate (it can be demonstrated!)
If n , it is always possible to find x sufficiently
close so that:
P(e
i
| x) ~ P(e
i
| x)

Bounds on Error Rate of k-Nearest Neighbor Rule
As k gets larger the error rate equals the Bayes rate
k should be a small fraction of the total number of
samples
Nearest Neighbor
If P(e
m
| x) ~ 1, then the nearest neighbor
selection is almost always the same as the Bayes
selection
59
Nearest Neighbor
Nearest Neighbor
N-n classifier effectively partitions the feature space into
cells consisting of all points closer to a given training point
x than to any other training points.
All points in such a cell are thus labeled by the category of
the training point Voronoi tesselation of the space
The k Nearest-Neighbor Rule
Classify x by assigning it the label most frequently
represented among the k nearest samples and use a
voting scheme
Example
Example:k = 3 (odd value) and x
= (0.10, 0.25)
t
Closest vectors to x with their
labels are:
{(0.10, 0.28, e
2
); (0.12, 0.20, e
2
);
(0.15, 0.35,e
1
)}
One voting scheme assigns the
label e
2
to x since e
2
is the most
frequently represented
Prototypes Labels
(0.15, 0.35)
(0.10, 0.28)
(0.09, 0.30)
(0.12, 0.20)
e
1
e
2
e
5
e
2

ICS583 112 Non-Parametric Density Estimation S

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

ICS583 112 Non-Parametric Density Estimation S

Încărcat de

Drepturi de autor:

Formate disponibile

ICS583 Non-Parametric density estimation 2

There are two ways of obtaining the sequence of

It follows that ((x - x

ICS583 Non-Parametric density estimation 26

The Parzen-window density estimate using n training

S-ar putea să vă placă și