Lec Hil Space 342-09845

Feature space
Basics of reproducing kernel Hilbert spaces

Kernel Ridge Regression
Lecture 1: Introduction to RKHS

Columbia, 2014
Gatsby Unit, CSML, UCL
April 29, 2014
Feature space
Kernels and feature space (1): XOR example

5
x2
5
5
No linear classifier separates red from blue

Map points
feature space:
to higher dimensional

(x) = x1 x2 x1 x2 R3
Feature space
Kernels and feature space (2): smoothing
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
0.4
0.4
0.4
0.6
0.6
0.6
0.8
0.8
1
0.5
0.5
1.5
1
0.5
0.8
0
0.5
1.5
1
0.5
Kernel methods can control smoothness and avoid

overfitting/underfitting.
0.5
1.5
Feature space
What is a kernel?
Constructing new kernels
Positive definite functions
Reproducing kernel Hilbert space
Outline: reproducing kernel Hilbert space
We will describe in order:

1
Hilbert space
Kernel (lots of examples: e.g. you can build kernels from

simpler kernels)
Reproducing property
Feature space
What is a kernel?
Hilbert space
Definition (Inner product)
Let H be a vector space over R. A function h, iH : H H R
is an inner product on H if
1
2
Linear: h1 f1 + 2 f2 , g iH = 1 hf1 , g iH + 2 hf2 , g iH

Symmetric: hf , g iH = hg , f iH
hf , f iH 0 and hf , f iH = 0 if and only if f = 0.

p
Norm induced by the inner product: kf kH := hf , f iH
3
Definition (Hilbert space)

Inner product space containing Cauchy sequence limits.
Feature space
What is a kernel?
Hilbert space
1
2


p
3

Feature space
What is a kernel?
Hilbert space
1
2


p
3

Feature space
What is a kernel?
Hilbert space
Definition (Cauchy sequence)
A sequence {fn }
n=1 of elements of a normed vector space
(F, kkF ) is said to be a Cauchy (fundamental) sequence if for
every > 0, there exists N = N() N, such that for all n, m N,
kfn fm kF < .
Definition (Complete space)
A metric space F is said to be complete if every Cauchy sequence
{fn }
n=1 in F converges: it has a limit, and this limit is in F.
Complete + norm = Banach space
Complete + inner product = Hilbert space
Feature space
What is a kernel?
Hilbert space
Definition (Cauchy sequence)
A sequence {fn }
n=1 of elements of a normed vector space
(F, kkF ) is said to be a Cauchy (fundamental) sequence if for
every > 0, there exists N = N() N, such that for all n, m N,
kfn fm kF < .
Definition (Complete space)
A metric space F is said to be complete if every Cauchy sequence
{fn }
n=1 in F converges: it has a limit, and this limit is in F.
Complete + norm = Banach space
Complete + inner product = Hilbert space
Feature space
What is a kernel?
Kernel
Definition
Let X be a non-empty set. A function k : X X R is a kernel
if there exists an R-Hilbert space and a map : X H such that
x, x 0 X ,

k(x, x 0 ) := (x), (x 0 ) H .
Almost no conditions on X (eg, X itself doesnt need an inner
product, eg. documents).
A single kernel can correspond to several possible features. A
trivial example for X := R:

x/2
1 (x) = x
and
2 (x) =
x/ 2
Feature space
What is a kernel?
New kernels from old: sums, transformations

Theorem (Sums of kernels are kernels)
Given > 0 and k, k1 and k2 all kernels on X , then k and
k1 + k2 are kernels on X .
To prove this, just check inner product definition. A difference of
kernels may not be a kernel (why?)
Theorem (Mappings between spaces)
Let X and Xe be sets, and define a map A : X Xe. Define the
kernel k on Xe. Then the kernel k(A(x), A(x 0 )) is a kernel on X .
Example: k(x, x 0 ) = x 2 (x 0 )2 .
Feature space
What is a kernel?
New kernels from old: sums, transformations

Theorem (Sums of kernels are kernels)
Given > 0 and k, k1 and k2 all kernels on X , then k and
k1 + k2 are kernels on X .
To prove this, just check inner product definition. A difference of
kernels may not be a kernel (why?)
Theorem (Mappings between spaces)
Let X and Xe be sets, and define a map A : X Xe. Define the
kernel k on Xe. Then the kernel k(A(x), A(x 0 )) is a kernel on X .
Example: k(x, x 0 ) = x 2 (x 0 )2 .
Feature space
What is a kernel?
New kernels from old: products

Theorem (Products of kernels are kernels)
Given k1 on X1 and k2 on X2 , then k1 k2 is a kernel on X1 X2 .
If X1 = X2 = X , then k := k1 k2 is a kernel on X .
Proof.
Define:
H1 corresponding to k1 (x, x 0 ) = h1 (x), 1 (x 0 )iH1 (e.g.:
kernel between two images)
H2 corresponding to k2 (y , y 0 ) = h2 (y ), 2 (y 0 )iH2 (e.g.:
kernel between two captions)
Is the following a kernel? (e.g. between one image-caption pair and
another)

K (x, y ), (x 0 , y 0 ) = k1 (x, x 0 ) k2 (y , y 0 )
Feature space
What is a kernel?

Proof.
Given b H2 and a H1 , we define the tensor product a b as a
rank-one operator from H2 to H1 ,
(a b)f 7 hb, f iH2 a.
a b HS(H2 , H1 ), space of Hilbert-Schmidt operators with inner
product
X
hL, MiHS =
hLfj , Mfj iH1 ,
jJ
where {fj } an ONB of H2 . Applying the above definition, for any

L HS(G, F),
hL, a biHS = ha, LbiH1 .
What is a kernel?
Feature space

Proof.
P
To see this, first expand b = jJ hb, fj iH2 fj . Then
+
X
a, L hb, fj iG fj
*
ha, Lbi =
=
X
j
and
H1
hb, fj iH2 ha, Lfj iH1
ha b, LiHS =
hLfj , (a b)fj iH1

hb, fj iH2 ha, Lfj iH1 .
Feature space
What is a kernel?
Proof.
Special case:
hu v , a biHS = hu, aiH1 hb, v iH2 .
Apply this to

k1 (x, x 0 )k2 (y , y 0 ) = 1 (x), 1 (x 0 ) H 2 (y ), 2 (y 0 ) H
1
2
= 1 (x) 2 (y ), 1 (x 0 ) 2 (y 0 ) HS .
Feature space
What is a kernel?
Sums and products = polynomials
Theorem (Polynomial kernels)

Let x, x 0 Rd for d 1, and let m 1 be an integer and c 0 be
a positive real. Then

m
k(x, x 0 ) := x, x 0 + c
is a valid kernel.
To prove: expand into a sum (with non-negative scalars) of kernels
hx, x 0 i raised to integer powers. These individual terms are valid
kernels by the product rule.
Feature space
What is a kernel?
Infinite sequences
The kernels weve seen so far are dot products between finitely
many features. E.g.

>

sin(y ) y 3 log y
k(x, y ) = sin(x) x 3 log x

where (x) = sin(x) x 3 log x
Can a kernel be a dot product between infinitely many features?
What is a kernel?
Feature space
Infinite sequences
Definition
The space `p of p-summable sequences is defined as all sequences
(ai )i1 for which
X
aip < .
i=1
Kernels can be defined in terms of sequences in `2 .

Theorem
Given sequence of functions (i (x))i1 in `2 where i : X R is
the ith coordinate of (x). Then
k(x, x 0 ) :=
i (x)i (x 0 )
i=1
(1)
What is a kernel?
Feature space
Infinite sequences
Definition
The space `p of p-summable sequences is defined as all sequences
(ai )i1 for which
X
aip < .
i=1
Kernels can be defined in terms of sequences in `2 .

Theorem
Given sequence of functions (i (x))i1 in `2 where i : X R is
the ith coordinate of (x). Then
k(x, x 0 ) :=
i (x)i (x 0 )
i=1
(1)
Feature space
What is a kernel?
Infinite sequences (proof)

Proof: We just need to check that inner product remains finite.
Norm kak`2 associated with inner product (1)
kak`2
v
u
uX
a2 ,
:= t
i
i=1
where a represents sequence with terms ai . Via Cauchy-Schwarz,

X

i (x)i (x 0 ) ki (x)k`2 i (x 0 ) ` ,

2

i=1
so the sequence defining the inner product converges for all

x, x 0 X
What is a kernel?
Feature space
Taylor series kernels

Definition (Taylor series kernel)
For r (0, ], with an 0 for all n 0
f (z) =
an z n
|z| < r , z R,
n=0
Define X to be the
r -ball in Rd , sokxk <
k(x, x 0 ) = f
x, x 0
r,
n
an x, x 0 .
n=0
Example (Exponential kernel)

k(x, x 0 ) := exp
x, x 0
Feature space
What is a kernel?
Taylor series kernel (proof)

Proof: By Cauchy-Schwarz,

x, x 0 kxkkx 0 k < r ,
so the Taylor series converges. Define cj1 ...jd =
d
X
X
k(x, x 0 ) =
an
xj xj0
n=0
i =1 ji !
j=1
an
n=0
n!
Qd
X
j1 ...jd >0
cj1 ...jd
i=1
j1 . . . jd 0
j1 + . . . + jd = n
aj1 +...+jd cj1 ...jd
d
Y
i=1
d
Y
(xi , xi0 )ji
xiji
d
Y
(xi0 )ji .
i=1
Feature space
What is a kernel?
Gaussian kernel
Example (Gaussian kernel)

The Gaussian kernel on Rd is defined as

2
k(x, x 0 ) := exp 2 x x 0 .
Proof: an exercise! Use product rule, mapping rule, exponential
kernel.
Feature space
What is a kernel?
If we are given a function of two arguments, k(x, x 0 ), how can we

determine if it is a valid kernel?
1
Find a feature map?

1
2
2
Sometimes this is not obvious (eg if the feature vector is

infinite dimensional, e.g. the Gaussian kernel in the last slide)
The feature map is not unique.
A direct property of the function: positive definiteness.
Feature space
What is a kernel?
Definition (Positive definite functions)

A symmetric function k : X X R is positive definite if
n 1, (a1 , . . . an ) Rn , (x1 , . . . , xn ) X n ,
n X
n
X
i=1 j=1
ai aj k(xi , xj ) 0.
The function k(, ) is strictly positive definite if for mutually

distinct xi , the equality holds only when all the ai are zero.
Feature space
What is a kernel?
Kernels are positive definite

Theorem
Let H be a Hilbert space, X a non-empty set and : X H.
Then h(x), (y )iH =: k(x, y ) is positive definite.
Proof.
n X
n
X
i=1 j=1
ai aj k(xi , xj ) =
n X
n
X
i=1 j=1
hai (xi ), aj (xj )iH

2
n
X

=
ai (xi ) 0.

i=1
Reverse also holds: positive definite k(x, x 0 ) is inner product in a

unique H (Moore-Aronsajn: coming later!).
The reproducing kernel Hilbert space
Feature space
What is a kernel?
First example: finite space, polynomial features

Reminder: XOR example:
5
x2
5
5
Feature space
What is a kernel?

Reminder: Feature space from XOR motivating example:
: R2 R3

x1
x1
x=
7 (x) = x2 ,
x2
x1 x2
with kernel
>
x1
y1
k(x, y ) = x2 y2
x1 x2
y1 y2
(the standard inner product in R3 between features). Denote this

feature space by H.
Feature space
What is a kernel?

Define a linear function of the inputs x1 , x2 , and their product x1 x2 ,
f (x) = f1 x1 + f2 x2 + f3 x1 x2 .
f in a space of functions mapping from X = R2 to R. Equivalent
representation for f ,

>
f () = f1 f2 f3
.
f () refers to the function as an object (here as a vector in R3 )
f (x) R is function evaluated at a point (a real number).
f (x) = f ()> (x) = hf (), (x)iH
Evaluation of f at x is an inner product in feature space (here
standard inner product in R3 )
H is a space of functions mapping R2 to R.
Feature space
What is a kernel?

Define a linear function of the inputs x1 , x2 , and their product x1 x2 ,
f (x) = f1 x1 + f2 x2 + f3 x1 x2 .
f in a space of functions mapping from X = R2 to R. Equivalent
representation for f ,

>
f () = f1 f2 f3
.
f () refers to the function as an object (here as a vector in R3 )
f (x) R is function evaluated at a point (a real number).
f (x) = f ()> (x) = hf (), (x)iH
Evaluation of f at x is an inner product in feature space (here
standard inner product in R3 )
H is a space of functions mapping R2 to R.
Feature space
What is a kernel?

(y ) is a mapping from R2 to R3 . . .
. . .which also parametrizes a function mapping R2 to R.
>

= (y ),
k(, y ) := y1 y2 y1 y2
Given y , there is a vector k(, y ) in H such that
hk(, y ), (x)iH = ax1 + bx2 + cx1 x2 ,
where a = y1 , b = y2 , and c = y1 y2
Due to symmetry,
hk(, x), (y )i = uy1 + vy2 + wy1 y2
= k(x, y ).
We can write (x) = k(, x) and (y ) = k(, y ) without ambiguity:

canonical feature map
Feature space
What is a kernel?

(y ) is a mapping from R2 to R3 . . .
. . .which also parametrizes a function mapping R2 to R.
>

= (y ),
k(, y ) := y1 y2 y1 y2
Given y , there is a vector k(, y ) in H such that
hk(, y ), (x)iH = ax1 + bx2 + cx1 x2 ,
where a = y1 , b = y2 , and c = y1 y2
Due to symmetry,
hk(, x), (y )i = uy1 + vy2 + wy1 y2
= k(x, y ).
We can write (x) = k(, x) and (y ) = k(, y ) without ambiguity:

canonical feature map
Feature space
What is a kernel?
The reproducing property
This example illustrates the two defining features of an RKHS:

The reproducing property:
x X , f () H, hf (), k(, x)iH = f (x)
. . .or use shorter notation hf , (x)iH .
In particular, for any x, y X ,
k(x, y ) = hk (, x) , k (, y )iH .
Note: the feature map of every point is in the feature space:
x X , k(, x) = (x) H,
Feature space
What is a kernel?

Another, more subtle point:H can be larger than all (x).
Why?
(x) : x X
E.g. f = [1 1 1] H cannot be obtained by (x) = [x1 x2 (x1 x2 )].

Feature space
What is a kernel?

Another, more subtle point:H can be larger than all (x).
Why?
(x) : x X
E.g. f = [1 1 1] H cannot be obtained by (x) = [x1 x2 (x1 x2 )].

Feature space
What is a kernel?
Second (infinite) example: fourier series

Function on the torus T := [, ] with periodic boundary. Fourier
series:
X
X
f (x) =
f` exp(`x) =
f` (cos(`x) + sin(`x)) .
`=
l=
Example: top hat function,

(
1 |x| < T ,
f (x) =
0 T |x| < .
Fourier series:
sin(`T )
f` :=
`
f (x) =
2f` cos(`x).
`=0
Feature space
What is a kernel?
Fourier series for top hat function

Top hat
Basis function
1
cos( x)
1.4
1.2
1
0.5
0
0.5
1
4
0.6
10
t
Fourier series coefficients
0.5
0.4
0.4
0.2
0.3
f (x)
0.8
0.2
0.1
0.2
4
0
10
Feature space
What is a kernel?

Top hat
Basis function
1
cos( x)
1.4
1.2
1
0.5
0
0.5
1
4
0.6
10
t
0.5
0.4
0.4
0.2
0.3
f (x)
0.8
0.2
0.1
0.2
4
0
10
Feature space
What is a kernel?

Top hat
Basis function
1
cos( x)
1.4
1.2
1
0.5
0
0.5
1
4
0.6
10
t
0.6
0.4
0.4
0.2
f (x)
0.8
0.2
0
0
0.2
4
0.2
10
Feature space
What is a kernel?

Top hat
Basis function
1
cos( x)
1.4
1.2
1
0.5
0
0.5
1
4
0.6
10
t
0.6
0.4
0.4
0.2
f (x)
0.8
0.2
0
0
0.2
4
0.2
10
Feature space
What is a kernel?

Top hat
Basis function
1
cos( x)
1.4
1.2
1
0.5
0
0.5
1
4
0.6
10
t
0.6
0.4
0.4
0.2
f (x)
0.8
0.2
0
0
0.2
4
0.2
10
Feature space
What is a kernel?

Top hat
Basis function
1
cos( x)
1.4
1.2
1
0.5
0
0.5
1
4
0.6
10
t
0.6
0.4
0.4
0.2
f (x)
0.8
0.2
0
0
0.2
4
0.2
10
Feature space
What is a kernel?

Top hat
Basis function
1
cos( x)
1.4
1.2
1
0.5
0
0.5
1
4
0.6
10
t
0.6
0.4
0.4
0.2
f (x)
0.8
0.2
0
0
0.2
4
0.2
10
Feature space
What is a kernel?
Fourier series for kernel function

Kernel takes a single argument,
k(x, y ) = k(x y ),
Define the Fourier series representation of k
k(x) =
k` exp (`x) ,
`=
k and its Fourier transform are real and symmetric. E.g. ,

2 2
x 2
`
1
1
,
,
k` =
exp
.
k(x) =
2
2 2
2
2
is the Jacobi theta function, close to Gaussian when 2 sufficiently narrower than
[, ].
Feature space
What is a kernel?
Fourier series for Gaussian-spectrum kernel

Gaussian
Basis function
cos( x)
0.6
0.5
0.5
0
0.5
1
4
t
0.3
0.2
0.2
0.15
0.1
k (x)
0.4
0.1
4
0.1
0.05
0
10
10
Feature space
What is a kernel?

Gaussian
Basis function
cos( x)
0.6
0.5
0.5
0
0.5
1
4
t
0.3
0.2
0.2
0.15
0.1
k (x)
0.4
0.1
4
0.1
0.05
0
10
10
Feature space
What is a kernel?

Gaussian
Basis function
cos( x)
0.6
0.5
0.5
0
0.5
1
4
t
0.3
0.2
0.2
0.15
0.1
k (x)
0.4
0.1
4
0.1
0.05
0
10
10
Feature space
What is a kernel?

Gaussian
Basis function
cos( x)
0.6
0.5
0.5
0
0.5
1
4
t
0.3
0.2
0.2
0.15
0.1
k (x)
0.4
0.1
4
0.1
0.05
0
10
10
What is a kernel?
Feature space
Feature space via fourier series

Define H to be the space of functions with (infinite) feature space
representation
f () =
. . . f` /
i>
q
.
k` . . .
The space H has an inner product:

hf , g iH =
f g
q ` ` q .
`=
k`
k`
Define the feature map

k(, x) = (x) =
...
k` exp(`x) . . .
i>
What is a kernel?
Feature space

Define H to be the space of functions with (infinite) feature space
representation
f () =
. . . f` /
i>
q
.
k` . . .
The space H has an inner product:

hf , g iH =
f g
q ` ` q .
`=
k`
k`
Define the feature map

k(, x) = (x) =
...
i>
q
k` exp(`x) . . .
What is a kernel?
Feature space

The reproducing theorem holds,
hf (), k(, x)iH =
=

X
f`
`=
q
k` exp(`x)
q
k`
f` exp(`x) = f (x),
`=
. . .including for the kernel itself,

!
q
q
X
hk(, x), k(, y )iH =
k` exp(`x)
k` exp(`y )
=
`=
`=
k` exp(`(y x)) = k(x y ).

What is a kernel?
Feature space

The reproducing theorem holds,
hf (), k(, x)iH =
=

X
f`
`=
q
k` exp(`x)
q
k`
f` exp(`x) = f (x),
`=
. . .including for the kernel itself,

!
q
q
X
hk(, x), k(, y )iH =
k` exp(`x)
k` exp(`y )
=
`=
`=
k` exp(`(y x)) = k(x y ).

Feature space
What is a kernel?
Fourier series: what does it achieve?

The squared norm of a function f in H is:
kf k2H = hf , f iH =

X
f` f`
.
k`
l=
If k` decays fast, then so must f` if we want kf k2H < .

Recall
X
f (x) =
f` (cos(`x) + sin(`x)) .
`=
Enforces smoothness.
Question: is the top hat function in the Gaussian RKHS?
Feature space
What is a kernel?


X
f` f`
.
k`
l=

Recall
X
f (x) =
f` (cos(`x) + sin(`x)) .
`=
Feature space
What is a kernel?


X
f` f`
.
k`
l=

Recall
X
f (x) =
f` (cos(`x) + sin(`x)) .
`=
What is a kernel?
Feature space
Third example: infinite feature space

Reproducing
Pm property for function
Pm with Gaussian kernel:
f (x) := i=1 i k(xi , x) = h i=1 i (xi ), (x)iH .
1
0.8
f(x)
0.6
0.4
0.2
0
0.2
0.4
6
What do the features (x) look like (there are infinitely many
of them, they are not unique!)
What do these features have to do with smoothness?
What is a kernel?
Feature space

Reproducing
1
0.8
f(x)
0.6
0.4
0.2
0
0.2
0.4
6
What is a kernel?
Feature space

Reproducing
1
0.8
f(x)
0.6
0.4
0.2
0
0.2
0.4
6
Feature space
What is a kernel?

Define RKHS kernel k such that kkkL2 () < and the associated
RKHS H is separable. The operator
Tk
L2 () L2 ()
f 7 f (x 0 )k(x, x 0 )d (x 0 )
is compact, postive, self-adjoint. (Steinwart and Christmann, Theorem 4.27)

By the spectral theorem there is an at most countable ONS s.t.
(
X
1 i =j
Tk f =
j hf , ej i ej
ei (x)ej (x)d (x) =
0 i 6= j.
X
j
Can we use the {i , ei } to construct a feature space for H?
What is a kernel?
Feature space
Theorem
(Mercer) Let X be a compact metric space, k be a continous
kernel, and be a finite Borel measure with supp{} = X . Then
the convergence of
X
k(x, y ) =
j ej (x)ej (y )
j
is absolute and uniform (ej is the continuous element of the L2

equivalence class ej .).
What is a kernel?
Feature space

Theorem
(Mercer RKHS)(Steinwart and Christmann, Theorem 4.51) Under the
assumptions of Mercers theorem,
)
(
X p
H :=
ai i ei : ai `2
i
is an RKHS
with

kernel k. The feature map is
i ei (x) . . . .
(x) = . . .
Given two functions in the RKHS
X p
X p
f :=
ai i ei ,
g :=
bi i ei ,
i
the inner product is hf , g iH =
ai bi
(2)
What is a kernel?
Feature space

Proof: Most of the requirements for this being a Hilbert space are
straightforward. There are two aspects requiring care:
1 Is k(x, ) H
x X ? Requires Mercers theorem
2 Does the reproducing property hold? hf , k(, x)i
H = f (x).
First part:
By the definition of H (2), the function in H indexed by x is
p

X p
k(x, ) =
i ei (x)
i ei () .
i
Is this function in the RKHS? Yes, if the `2 norm of

bounded. This is due to Mercer: x X ,
p
2

i ei (x) = k(x, x) < .

i ei (x) is
`2
Feature space
What is a kernel?
Proof:
Second part:
The reproducing property holds: using the inner product definition,

X p
i ei (x) = f (x),
hf , k(x, )iH =
fi
i
which is always well defined since both f `2 and k(x, ) `2 .
Feature space
What is a kernel?

k2
Gaussian kernel, k(x, y ) = exp kxy
,
2 2
k
bk
b<1
ek (x) exp((c a)x 2 )Hk (x 2c),
a, b, c are functions of , and Hk is kth order Hermite polynomial.
k(x, x ) =
i ei (x)ei (x 0 )
i=1
(Figure from Rasmussen and Williams)

WARNING: R is non-compact domain,
cannot use Mercer argument in form
given earlier.
Feature space
What is a kernel?

Example RKHS function, Gaussian kernel:
m
m
hp
i
X
X
X
X
j ej (xi )ej (x) =
fj
i k(xi , x) =
i
f (x) :=
j ej (x)
i=1
where fj =
j=1
i=1
Pm
i=1 i
j=1
j ej (xi ).
1
0.8
f(x)
0.6
0.4
0.2
0
0.2
0.4
6
NOTE that this

enforces
smoothing:
j decay as ej
become rougher,
fP
j decay since
2
j fj < .
x
Some reproducing kernel Hilbert space theory
Feature space
What is a kernel?
Reproducing kernel Hilbert space (1)

Definition
H a Hilbert space of R-valued functions on non-empty set X . A
function k : X X R is a reproducing kernel of H, and H is a
reproducing kernel Hilbert space, if
x X , k(, x) H,
x X , f H, hf (), k(, x)iH = f (x) (the reproducing

property).
In particular, for any x, y X ,
k(x, y ) = hk (, x) , k (, y )iH .
(3)
Original definition: kernel an inner product between feature maps.

Then (x) = k(, x) a valid feature map.
Feature space
What is a kernel?
Reproducing kernel Hilbert space (2)

Another RKHS definition:
Define x to be the operator of evaluation at x, i.e.
x f = f (x) f H, x X .
Definition (Reproducing kernel Hilbert space)
H is an RKHS if the evaluation operator x is bounded: x X
there exists x 0 such that for all f H,
|f (x)| = |x f | x kf kH
= two functions identical in RHKS norm agree at every point:
|f (x) g (x)| = |x (f g )| x kf g kH
f , g H.
Feature space
What is a kernel?
RKHS definitions equivalent

Theorem (Reproducing kernel equivalent to bounded x )
H is a reproducing kernel Hilbert space (i.e., its evaluation
operators x are bounded linear operators), if and only if H has a
reproducing kernel.
Proof: If H has a reproducing kernel = x bounded
|x [f ]| = |f (x)|
= |hf , k(, x)iH |
kk(, x)kH kf kH
1/2
= hk(, x), k(, x)iH kf kH

= k(x, x)1/2 kf kH
Cauchy-Schwarz in 3rd line . Consequently, x : F R bounded

with x = k(x, x)1/2 .
Feature space
What is a kernel?
RKHS definitions equivalent

Proof: x bounded = H has a reproducing kernel
We use. . .
Theorem
(Riesz representation) In a Hilbert space H, all bounded linear
functionals are of the form h, g iH , for some g H.
If x : F R is a bounded linear functional, by Riesz fx H
such that
x f = hf , fx iH , f H.
Define k(x 0 , x) = fx (x 0 ), x, x 0 X . By its definition, both

k(, x) = fx H and hf , k(, x)iH = x f = f (x). Thus, k is the
reproducing kernel.
Feature space
What is a kernel?
Moore-Aronszajn Theorem
Theorem (Moore-Aronszajn)
Let k : X X R be positive definite. There is a unique RKHS
H RX with reproducing kernel k.
Recall feature map is not unique (as we saw earlier): only kernel is.
Feature space
What is a kernel?
Main message #1
Reproducing kernels
Posi1ve denite func1ons
Hilbert func1on spaces with

bounded point evalua1on
What is a kernel?
Feature space
Main message #2
Small RKHS norm results in smooth functions.
E.g. kernel ridge regression with Gaussian kernel:
n
X
f = arg min
f H
i=1
!
2
(yi hf , (xi )iH ) + kf k2H
=0.1, =0.6
=10, =0.6
0.5
0.5
=1e07, =0.6
1.5
1
0.5
0.5
0.5
1
0.5
0.5
1.5
1
0.5
0.5
0.5
1.5
1
0.5
0.5
1.5
Feature space
What is a kernel?
Moore-Aronszajn Theorem: pre-RKHS
How do we prove this? (Sketch only - a very good full proof is in

Berlinet and Thomas-Agnan, 2004, Chapter 1)
Starting with a positive def. k, construct a pre-RKHS (an inner

product space) H0 RX with properties:
1
2
The evaluation functionals x are continuous on H0 ,
Any H0 -Cauchy sequence fn which converges pointwise to 0

also converges in H0 -norm to 0
What is a kernel?
Feature space
Moore-Aronszajn Theorem: pre-RKHS

pre-RKHS H0 = span {k(, x) | x X } will be taken to be the set
of functions:
n
X
f (x) =
i k(x, xi )
i=1
1
0.8
f(x)
0.6
0.4
0.2
0
0.2
0.4
6
x
Feature space
What is a kernel?
Moore-Aronszajn Theorem: Steps

Theorem (Moore-Aronszajn - Step A)
Space H0 = span {k(, x) | x X }, endowed with the inner product
hf , g iH0
where f =
pre-RKHS.
Pn
i=1 i k(, xi )
n X
m
X
i j k(xi , yj ),
i=1 j=1
and g =
Pm
j=1 j k(, yj ),
is a valid
Theorem (Moore-Aronszajn - Step B)

Let H0 be a pre-RKHS space. Define H to be the set of functions
f RX for which there exists an H0 -Cauchy sequence {fn }
converging pointwise to f . Then, H is an RKHS.
Feature space
What is a kernel?
Moore-Aronszajn Theorem: Steps

Theorem (Moore-Aronszajn - Step A)
Space H0 = span {k(, x) | x X }, endowed with the inner product
hf , g iH0
where f =
pre-RKHS.
Pn
i=1 i k(, xi )
n X
m
X
i j k(xi , yj ),
i=1 j=1
and g =
Pm
j=1 j k(, yj ),
is a valid
Theorem (Moore-Aronszajn - Step B)

Let H0 be a pre-RKHS space. Define H to be the set of functions
f RX for which there exists an H0 -Cauchy sequence {fn }
converging pointwise to f . Then, H is an RKHS.
Feature space
What is a kernel?
Moore-Aronszajn Theorem - Step A
Is hf , g iH0 a valid inner product?
Are evaluation functionals x are continuous on H0 ?
Does every H0 -Cauchy sequence fn which converges pointwise

to 0 also converge in H0 -norm to 0?
Feature space
What is a kernel?
Moore-Aronszajn Theorem- Step B

Define H to be the set of functions f RX for which there exists
an H0 -Cauchy sequence {fn } converging pointwise to f . Clearly,
H0 H.
1
We define the inner product between f , g H as the limit of

an inner product of the H0 -Cauchy sequences {fn }, {gn }
converging to f and g respectively. Is this inner product well
defined, i.e., independent of the sequences used?
An inner product space must satisfy hf , f iH = 0 iff f = 0. Is

this true when we define the inner product on H as above?
3
4
Are the evaluation functionals still continuous on H?
Is H complete (i.e., does every H-Cauchy sequence converge)?

(1)+(2)+(3)+(4) = H is RKHS!
Feature space
What is a kernel?

H0 H.
1


3
4

(1)+(2)+(3)+(4) = H is RKHS!
Feature space
What is a kernel?

H0 H.
1


3
4

(1)+(2)+(3)+(4) = H is RKHS!
Feature space
What is a kernel?

H0 H.
1


3
4

(1)+(2)+(3)+(4) = H is RKHS!
Feature space
What is a kernel?

H0 H.
1


3
4

(1)+(2)+(3)+(4) = H is RKHS!
Feature space
What is a kernel?

H0 H.
1


3
4

(1)+(2)+(3)+(4) = H is RKHS!
Feature space
Kernel ridge regression
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
0.4
0.4
0.4
0.6
0.6
0.6
0.8
0.8
0.8
1
0.5
0.5
1.5
1
0.5
0.5
1.5
1
0.5
Very simple to implement, works well when no outliers.
0.5
1.5
Feature space
Ridge regression: case of RD

We are given n training points in RD :
X =
x1 . . . xn
RDn
y :=
y1 . . . yn
>
Define some > 0. Our goal is:

f = arg min
f Rd
n
X
(yi xi> f )2 + kf k2
i=1
The second term kf k2 is chosen to avoid problems in high

dimensional spaces (see below).
Feature space
Ridge regression: case of RD

We are given n training points in RD :
X =
x1 . . . xn
RDn
y :=
y1 . . . yn
Define some > 0. Our goal is:

n
X
(yi xi> f )2 + kf k2
a = arg min
f Rd
i=1
Solution is:
f =
XX > + I
1
Xy ,
which is the classic regularized least squares solution.

>
Feature space

Use features of (xi ) in the place of xi :
f = arg min
f H
n
X
i=1
!
(yi hf , (xi )iH )2 + kf k2H
E.g. for finite dimensional feature spaces,
p (x) =
x
x2
..
.
x`
s (x) =
sin x
cos x
sin 2x
..
.
cos `x
a is a vector of length ` giving weight to each of these features so

as to find the mapping between x and y . Feature vectors can also
have infinite length (more soon).
Feature space

Solution easy if we already know f is a linear combination of
feature space mappings of points: representer theorem.
f =
n
X
i (xi ) =
i=1
n
X
i=1
i k(xi , ).
1
0.8
f(x)
0.6
0.4
0.2
0
0.2
0.4
6
x
Feature space
Representer theorem
Given a set of paired observations (x1 , y1 ), . . . (xn , yn ) (regression or
classification).
Find the function f in the RKHS H which satisfies
J(f ) = min J(f ),
f H
where

J(f ) = Ly (f (x1 ), . . . , f (xn )) + kf k2H ,
is non-decreasing, and y is the vector of yi .

P
Classification: Ly (f (x1 ), . . . , f (xn )) = ni=1 Iyi f (xi )0
P
Regression: Ly (f (x1 ), . . . , f (xn )) = ni=1 (yi f (xi ))2
(4)
Feature space
Representer theorem
The representer theorem:(simple version) solution to

h

i
min Ly (f (x1 ), . . . , f (xn )) + kf k2H
f H
takes the form

f =
n
X
i=1
i k(xi , ).
If is strictly increasing, all solutions have this form.
Feature space
Representer theorem: proof

Proof: Denote fs projection of f onto the subspace
span {k(xi , ) : 1 i n} ,
such that
f = fs + f ,
Pn
where fs =
Regularizer:
i=1 i k(xi , ).
kf k2H = kfs k2H + kf k2H kfs k2H ,

then

kf k2H kfs k2H ,
so this term is minimized for f = fs .

(5)
Feature space
Representer theorem: proof

Proof (cont.): Individual terms f (xi ) in the loss:
f (xi ) = hf , k(xi , )iH = hfs + f , k(xi , )iH = hfs , k(xi , )iH ,
so
Ly (f (x1 ), . . . , f (xn )) = Ly (fs (x1 ), . . . , fs (xn )).
Hence
Loss L(. . .) only depends on the component of f in the data
subspace,
Regularizer (. . .) minimized when f = fs .
If is strictly non-decreasing, then kf kH = 0 is required at
the minimum.
Feature space
Kernel ridge regression: proof

We begin knowing f is a linear combination of feature space
mappings of points (representer theorem)
f =
n
X
i (xi ).
i=1
Then
n
X
i=1
(yi hf , (xi )iH )2 + kf k2H = ky K k2 + > K
Differentiating wrt and setting this to zero, we get

= (K + In )1 y .
Feature space
Reminder: smoothness
What does kakH have to do with smoothing?
Example 1: The Fourier series representation on torus T:
f (x) =
fl exp(lx),
l=
and
hf , g iH =

X
fl gl
.
kl
l=
Thus,
2

fl
l=
kl
Feature space
Reminder: smoothness
What does kakH have to do with smoothing?
Example 2: The Gaussian kernel on R. Recall
f (x) =
X
p
ai i ei (x),
i=1
kf k2H =
ai2 .
i=1
Feature space
Parameter selection for KRR

Given the objective
f
= arg min
f H
n
X
i=1
!
2
(yi hf , (xi )iH ) + kf
k2H
How do we choose
The regularization parameter ?
The kernel parameter: for Gaussian kernel, in

kx y k2
k(x, y ) = exp
.
Feature space
Choice of
=0.1, =0.6
1
0.5
0.5
1
0.5
0.5
1.5
Feature space
Choice of
=0.1, =0.6
=10, =0.6
0.5
0.5
=1e07, =0.6
1.5
1
0.5
0.5
0.5
1
0.5
0.5
1.5
1
0.5
0.5
0.5
1.5
1
0.5
0.5
1.5
Feature space
Choice of
=0.1, =0.6
1
0.5
0.5
1
0.5
0.5
1.5
Feature space
Choice of
=0.1, =0.6
=0.1, =2
=0.1, =0.1
0.5
0.5
0.5
0.5
0.5
0.5
1
0.5
0.5
1.5
1
0.5
0.5
1.5
1
0.5
0.5
1.5

Lec Hil Space 342-09845

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lec Hil Space 342-09845

Încărcat de

Drepturi de autor:

Formate disponibile

Feature space

Basics of reproducing kernel Hilbert spaces

Lecture 1: Introduction to RKHS

Gatsby Unit, CSML, UCL

April 29, 2014

Lecture 1: Introduction to RKHS

Kernels and feature space (1): XOR example

No linear classifier separates red from blue

Kernels and feature space (2): smoothing

Kernel methods can control smoothness and avoid

Lecture 1: Introduction to RKHS

Outline: reproducing kernel Hilbert space

We will describe in order:

Kernel (lots of examples: e.g. you can build kernels from

Lecture 1: Introduction to RKHS

Linear: h1 f1 + 2 f2 , g iH = 1 hf1 , g iH + 2 hf2 , g iH

hf , f iH 0 and hf , f iH = 0 if and only if f = 0.

Definition (Hilbert space)

Lecture 1: Introduction to RKHS

Linear: h1 f1 + 2 f2 , g iH = 1 hf1 , g iH + 2 hf2 , g iH

hf , f iH 0 and hf , f iH = 0 if and only if f = 0.

Definition (Hilbert space)

Lecture 1: Introduction to RKHS

Linear: h1 f1 + 2 f2 , g iH = 1 hf1 , g iH + 2 hf2 , g iH

hf , f iH 0 and hf , f iH = 0 if and only if f = 0.

Definition (Hilbert space)

Lecture 1: Introduction to RKHS

New kernels from old: sums, transformations

Lecture 1: Introduction to RKHS

New kernels from old: sums, transformations

Lecture 1: Introduction to RKHS

New kernels from old: products

New kernels from old: products

where {fj } an ONB of H2 . Applying the above definition, for any

New kernels from old: products

hb, fj iH2 ha, Lfj iH1

hLfj , (a b)fj iH1

Lecture 1: Introduction to RKHS

New kernels from old: products

Lecture 1: Introduction to RKHS

Sums and products = polynomials

Theorem (Polynomial kernels)

Lecture 1: Introduction to RKHS

Lecture 1: Introduction to RKHS

Kernels can be defined in terms of sequences in `2 .

Kernels can be defined in terms of sequences in `2 .

Infinite sequences (proof)

where a represents sequence with terms ai . Via Cauchy-Schwarz,

so the sequence defining the inner product converges for all

Taylor series kernels

r -ball in Rd , sokxk <

Example (Exponential kernel)

Lecture 1: Introduction to RKHS

Taylor series kernel (proof)

Lecture 1: Introduction to RKHS

Example (Gaussian kernel)

Lecture 1: Introduction to RKHS

Positive definite functions

If we are given a function of two arguments, k(x, x 0 ), how can we

Find a feature map?

Sometimes this is not obvious (eg if the feature vector is

A direct property of the function: positive definiteness.

Lecture 1: Introduction to RKHS

Positive definite functions