Sunteți pe pagina 1din 45

Kernel Methods for Clustering

Francesco Camastra
camastra@ieee.org
DISI, Universit ` a di Genova
Talk Outlines
Preliminaries (Unsupervised Learning, Kernel Methods)
Kernel Methods for Clustering
Experimental Results
Conclusions and Future Work
The Learning Problem
The learning problem can be described as nding a general rule (description)
that explains data given only a sample of limited size.
Learning Algorithms
Learning algorithms can be grouped in three big families:
Supervised algorithms
Reinforcement Learning algorithms
Unsupervised algorithms
Supervised Algorithms
If data is a sample of input-output patterns, a data description is a function that
produces the output given the input.
The learning is called supervised because target values
(e.g. classes, real values) are associated with data.
Unsupervised Algorithms
If data is only a sample of objects without associated target values, the
problem is known as unsupervised learning.
A data description can be:
a set of clusters or a probability density function stating the probability to
observe a certain object in the future.
a manifold that contains all data without information loss (manifold
learning).
Kernel Methods
Kernel Methods are algorithms that implicitly perform,
by replacing the inner product with an appropriate Mercer Kernel , a nonlinear
mapping of the input data to a high dimensional Feature Space.
Mercer Kernel
We call the kernel G a Mercer kernel (or positive denite kernel) if and only if is
symmetric (i.e G(x, y) = G(y, x) x, y X) and
n

j,k=1
c
j
c
k
G(x
j
, x
k
) > 0
for all n 2, x
1
, . . ., x
n
X and c
1
, . . ., c
n
R.
Each kernel G(x, y) can be represented as:
G(x, y) = (x) (y) : X T.
T is called Feature Space. If is known, the mapping is explicit , otherwise is
implicit .
Mercer Kernel Examples
Square Kernel S(x, y) = (x y)
2
Square Kernel mapping is explicit.
If we consider x = (x
1
, x
2
) y = (y
1
, y
2
) we have:
S(x, y) = (x
1
y
1
+x
2
y
2
)
2
= x
1
2
y
1
2
+ 2x
1
x
2
y
1
y
2
+x
2
2
y
2
2
Therefore is:
(x) = (x
1
2
, x
2
2
,

2x
1
x
2
) : R
2
R
3
(in the n-dimensional case x R
n
, : R
n
R
n(n+1)
2
)
Gaussian Kernel G(x, y) = exp(
xy
2

2
)
Gaussian kernel mapping is implicit.
Distance in Feature Space
Given two vectors x and y, we remember that G(x, y) = (x) (y)
Therefore it is always possible to compute their distance in the Feature
Space:
|(x) (y)|
2
= ((x) (y)) ((x) (y))
= (x) (x) 2(x) (y) + (y) (y)
= G(x, x) 2G(x, y) +G(y, y)
Clustering Methods
Following Jains approach (Jain et al., 1999), clustering methods can be
categorized into:
Hierarchical Clustering
Hierarchical schemes sequentially build nested clusters with a graphical
representation (dendrogram).
Partitioning Clustering
Partitioning methods directly assign all the data points according to some
appropriate criteria (e.g. similarity and density) into different groups (clusters).
The Research has been focused on the prototyped-based clustering
algorithms, which is the most popular class of Partitioning Clustering Methods.
Clustering: Some Denitions
Let D be a data set, whose cardinality is m, formed by vectors R
n
. The
Codebook is the set W = (w
1
, w
2
, . . . , w
k
) where each element (codevector)
w
c
R
n
and k m.
The Voronoi set (V
c
) of the codevector w
c
is the set of all vectors in D for which
w
c
is the nearest codevector
V
c
= D [ c = arg min
j
| w
j
|
A codebook is optimal if minimizes the Quantization error J:
J =
1
2[D[

V
c
m

j=1
| w
c
|
2
where [D[ is the cardinality of D.
K-Means (Lloyd, 1957)
1. Initialize the codebook W = w
1
, w
2
, . . . , w
K
with K vectors chosen from
the data set D.
2. Compute for each codevector w
i
W its Voronoi Set V
i
V
i
= D [ i = arg min
j
| w
j
|
3. Move each codevector w
i
to the mean of its Voronoi Set. w
i
=
1
|V
i
|

V
i

4. Go to step 2 if any codevector w


i
changes otherwise return the codebook.
Kernel Methods for Clustering
Methods that kernelise the metric (Yu et al. 2002), i.e. the metric is
computed by means of a Mercer Kernel in a Feature Space.
Kernel K-Means (Girolami, 2002)
Kernel methods based on support vector data description (Camastra and
Verri, 2005)
Kernelising the metric
Given x, y R
n
the metric d
G
(x, y) in the Feature Space is:
d
G
(x, y) = |(x) (y)| = (G(x, x) 2G(x, y) +G(y, y))
1
2
.
Given a dataset D = (x
i
R
n
, i = 1, 2, . . . , m), the goal is to get a codebook
W = (w
i
R
n
, i = 1, 2, . . . , K) that minimizes the quantization error E(D)
E(D) =
1
2[D[
m

c=1

x
i
V
c
|x
i
w
c
|
2
where V
c
is the Voronoi set of the codebook w
c
.
Hence we can think to compute the metrics in the Feature Space, i.e. we have:
E
G
(D) =
1
2[D[
m

c=1

x
i
V
c
G(x
i
, x
i
) 2G(x
i
, w
c
) +G(w
c
, w
c
)
Kernelising the metric (cont.)
A naive solution to minimize E
G
(D) consists in computing
E
G
(D)
w
c
and using a
steepest gradient descent algorithm. Hence some classical clustering
algorithms can be kernelised. For instance, we consider online K-MEANS. Its
learning rule is
w
c
= ( w
c
) =
E(D)
w
c
where is the input vector, w
c
is the winner codevector for . Hence it can be
rewritten as:
w
c
=
E
G
(D)
w
c
In the case of G(x, y) = exp(
w
c

2
) the equation becomes
w
c
= ( w
c
)
2
exp

| w
c
|
2

One-Class SVM
One-Class SVM (1SVM) (Tax and Duin, 1999)(Scholkopf et al., 2001)
searches the hypersphere in the Feature space T with centre a and minimal
radius R containing most data. The problem can be expressed as:
min
R,a,
R
2
+C

i
subject to |(x
i
) a|
2
R
2
+
i
and
i
0 i = 1, . . . , m
where x
1
, . . . , x
l
is the data set. To solve the problem the Lagrangian / is
introduced:
/ = R
2

j
(R
2
+
j
|(x
i
) a|
2
)
j

j
+C

j
where
j
0 and
j
0 are Lagrange multipliers and C is constant.
1SVM (cont.)
Setting to zero the derivatives of / w.r.t R, a and
j
and substituting, we turn
the Lagrangian into the Wolfe dual form :
=

j
(x
j
) (x
j
)
j

j
(x
i
) (x
j
)
=

j
G(x
j
, x
j
)
j

j
G(x
i
, x
j
)
with

j
= 1 and 0
j
C

j
= 0 (x
j
) in the sphere

j
= C (x
j
) are outside the sphere.
0 <
j
< C (x
j
) in the surface of the sphere.
These points are called support vectors
1SVM (cont.)
The center a is: a =

j
(x
j
)
Therefore the center position can be unknown.
Nevertheless the distance R(x) of a point (x) from the center a can be
always computed:
R
2
(x) = G(x, x) 2

j
G(x
j
, x) +

j
G(x
i
, x
j
)
The Gaussian is a usual choice for the kernel G().
Camastra-Verri Algorithm: Denitions
Given a data set D, we map the data in a Feature Space T.
We consider K centers (a
i
, T, i = 1, . . . , K). We call the set
A = (a
1
, . . . , a
K
) Feature Space Codebook.
We dene for each center a
c
its Voronoi Set in Feature Space
FV
c
= x
i
D c = arg min
j
|(x
i
) a
j
|
Camastra-Verri Algorithm: Strategy
Our algorithm uses a K-Means-like strategy, i.e. it moves repeatedly the
centers, computing for each center 1SVM, until any center changes.
To make more robust the algorithm with respect to the outliers 1SVM is
computed on FV
c
() of each center a
c
FV
c
() = x
i
FV
c
and |(x
i
) a
c
| <
FV
c
() can be seen the Voronoi set in the Feature Space of the center a
c
without outliers.
The parameter can be set up using model selection techniques.
The Algorithm
1. Project the data Set D in a Feature Space T, by means a nonlinear
mapping . Initialize the centers a
c
c = 1, . . . , K a
c
T
2. Compute for each center a
c
FV
c
()
3. Apply 1SVM to each FV
c
() and assign to a
c
the center yielded, i.e.
a
c
= 1SV M(FV
c
())
4. Go to step 2 until any a
c
changes
5. Return the Feature Space codebook.
Kernel K-Means (Girolami, 2002)
1. Project the data Set D in a Feature Space T, by means a nonlinear
mapping . Initialize the centers a
c
c = 1, . . . , K a
c
T
2. Compute for each center a
c
FV
c
3. Move each codevector a
i
to the mean of its Feature Voronoi Set.
a
i
=
1
|FV
i
|

FV
i
()
4. Go to step 2 until any a
c
changes otherwise return the Feature Space
codebook.
Kernel K-Means (cont.)
Works even we do not know
We are always able to compute the distance of any point (x)from any
centroid a
c
. After some maths, we have
|(x) a
c
|
2
= R
2
(x) = G(x, x) 2

j
G(x
j
, x) +

j
G(x
i
, x
j
)
Hence even we do not know we are always able to compute Feature
Voronoi Set
Experiments with Camastra-Verri algorithm
Synthetic Data Set (Delta Set )
Iris Data (Fisher, 1936)
Wisconsin breast cancer database
(Wolberg and Mangasarian, 1990)
Spam data
K-Means (Lloyd 1957), Self Organizing Map (Kohonen, 1982), Neural Gas
(Martinetz et al., 1992), Ng-Jordan Algorithm (Ng et al., 2001) and Our
Algorithm have been tried.
Delta Set: K-Means
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Delta Set: Our Algorithm
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Delta Set: Our Algorithm (I iteration)
0 0.2 0.4 0.6 0.8 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Delta Set: Our Algorithm (II iteration)
0 0.2 0.4 0.6 0.8 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Delta Set: Our Algorithm (III iteration)
0 0.2 0.4 0.6 0.8 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Delta Set: Our Algorithm (IV iteration)
0 0.2 0.4 0.6 0.8 1
-0.5
-0.25
0
0.25
0.5
0.75
1
Delta Set: Our Algorithm (V iteration)
0 0.2 0.4 0.6 0.8 1
-0.75
-0.5
-0.25
0
0.25
0.5
0.75
1
Delta Set: Our Algorithm (VI iteration)
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Iris Data
Iris data is formed by 150 data points of three different classes. One class
(setosa) is linearly separable from the other two (versicolor, virginica), but
the other two are not linearly separable from each other.
Iris Data Dimension is 4.
K-Means, Self Organizing Map (SOM), Neural Gas, Ng-Jordan algorithm
and Our Algorithm have been tried.
Experimentations have been performed using three codevectors.
Iris Data
-9 -8 -7 -6 -5 -4 -3
-6.5
-6
-5.5
-5
-4.5
-4
Iris Data: K-Means
-9 -8 -7 -6 -5 -4 -3
-6.5
-6
-5.5
-5
-4.5
-4
Iris Data: Camastra-Verri algorithm
-9 -8 -7 -6 -5 -4 -3
-6.5
-6
-5.5
-5
-4.5
-4
Iris data: Results
model Points Classied Correctly
SOM 121.5 1.5 (81.0%)
K-Means 133.5 0.5 (89.0%)
Neural Gas 137.5 1.5 (91.7%)
Ng-Jordan Algorithm 126.5 7.5 (84.3%)
Our Algorithm 142 1 (94.7%)
Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm
performances on IRIS Data. The results have been obtained using twenty
different runs for each algorithm.
Wisconsin Data
Wisconsin breast cancer data is formed by 699 patterns (patients) of two
different classes. The classes are not linearly separable from each other.
The database considered in the experiments has 683 samples since we
have removed 16 patterns with missing values.
Wisconsin Data Dimension is 9. K-Means, Self Organizing Map (SOM),
Neural Gas, Ng-Jordan algorithm and Our Algorithm have been tried.
Experimentations have been performed using two codevectors.
Wisconsin database: Results
model Points Classied Correctly
K-Means 656.5 0.5 (96.1%)
Neural Gas 656.5 0.5 (96.1%)
SOM 660.5 0.5 (96.7%)
Ng-Jordan Algorithm 652 2 (95.5%)
Our Algorithm 662.5 0.5 (97.0%)
Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm
performances on Winscosins breast cancer database.
The results have been obtained using twenty different runs for each algorithm.
Spam Data
Spam Data is formed by 1534 patterns of two different classes (spam and
not-spam). The classes are not linearly separable from each other.
Spam Data Dimension is 57.
K-Means, Self Organizing Map (SOM), Neural Gas, Ng-Jordan algorithm
and Our Algorithm have been tried.
Experimentations have been performed using two codevectors.
Spam data: Results
model Points Classied Correctly
K-Means 1083 153 (70.6%)
Neural Gas 1050 120 (68.4%)
SOM 1210 30 (78.9%)
Ng-Jordan Algorithm 929 0 (60.6%)
Our Algorithm 1247 3 (81.3%)
Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm
performances on Spam data.
The results have been obtained using twenty different runs for each algorithm.
Conclusions and Future Works
Our algorithm performs better than K-Means, Som, Neural Gas and
Ng-Jordan algorithm on a synthetic data set and three UCI benchmarks
(Iris data, Wisconsin Breast Cancer Database, Spam Database)
Future efforts will be devoted to the application of our algorithm to
computer vision problems (e.g. color image segmentation)
At present we are investigating as kernel methods can be generalized in
terms of fuzzy logic (kernel-fuzzy methods).
Finally, experimental comparisons between our algorithm and Girolamis
algorithm are in progress.
Dedication
To my mother, Antonia Nicoletta Corbascio, in the most
difcult moment of her life.

S-ar putea să vă placă și