Sunteți pe pagina 1din 24

TMA 4180 Optimization Theory

Basic Mathematical Tools


H. E. Krogstad, IMF, spring 2008
1 INTRODUCTION
During the lectures we need some basic topics and concepts from mathematical analysis. This
material is actually not so dicult, if you happen to have seen it before. If this is the rst time,
experience has shown that even if it looks simple and obvious, it is necessary to spend some time
digesting it.
Nevertheless, the note should be read somewhat relaxed. Not all details are included, nor are all
proofs written out in detail. After all, this is not a course in mathematical analysis.
Among the central topics are the Taylor Formula in : dimensions, the general optimization setting,
and above all, basic properties of convex sets and convex functions. A very short review about
matrix norms and Hilbert space has also been included. The big optimization theorem in Hilbert
space is the Projection Theorem. Its signicance in modern technology and signal processing can
hardly be over-emphasized, although it is often disguised under other fancy names.
The nal result in the note is the Implicit Function Theorem which ensures the existence of
solutions of implicit equations.
The abbreviation N&W refers to the textbook, J. Nocedal and S. Wright: Numerical Optimization,
Springer. Note that page numbers in the rst and second editions are dierent.
2 TERMINOLOGY AND BASICS
Vectors in R
a
are, for simplicity, denoted by regular letters, r, j, ., , and |r| is used for their
length (norm),
|r| =
_
r
2
1
+ r
2
2
+ + r
2
a
_
12
. (1)
Occasionally, r
1
, r
2
, will also mean a sequence of vectors, but the meaning of the indices will
then be clear from the context.
We are considering functions ) from R
a
to R. Such a function will often be dened for all or
most of R
a
, but we may only be considering ) on a subset R
a
. Since the denition domain
of ) typically extends , it is in general not a problem to dene the derivatives of ),
0)
0a
i
, also
on the boundary of . The gradient, \), is a vector, and in mathematics (but not in N&W!) it
is considered to be a row vector,
\) =
_
0)
0r
1
,
0)
0r
2
, ,
0)
0r
a
_
. (2)
We shall follow this convention and write \) (r) j for \) (r) j. There are, however, some
situations where the direction d, dened by the gradient is needed, and then d = \)
t
. In the
lectures we use
t
for transposing vectors and matrices.
A set \ R
a
is open if all points in \ may be surrounded by a ball in R
a
belonging to \ : For
all r
0
\ , there is an r 0 such that
r ; |r r
0
| < r \. (3)
1
(This notation means The collection of all r-s such that |r r
0
| < r).
It is convenient also to say that a set \ is open in if there is an open set \ R
a
such
that \ = \ (The mathematical term for this is a relatively open set). Let = [0, 1] R.
The set [0, 1,2) is not open in R (why?). However, as a subset of , [0, 1,2) [0, 1], it is open in
, since [0, 1,2) = (1,2, 1,2) [0, 1] (Think about this for a while!).
A neighborhood of a point r is simply an open set containing r.
2.1 Sup and Inf Max and Min
Consider a set o of real numbers. The supremum of the set, denoted
supo, (4)
is the smallest number that is equal to or larger than all members of the set.
It is a very fundamental property of real numbers that the supremum always exists, although it
may be innite. If there is a member r
0
o such that
r
0
= supo, (5)
then r
0
is called a maximum and written
r
0
= max o. (6)
Sometimes such a maximum does not exists: Let
o =
_
1
1
:
; : = 1, 2,
_
. (7)
In this case, there is no maximum element in o. However, supo = 1, since no number less than
1 ts the denition. Nevertheless, 1 is not a maximum, since it is not a member of the set. This
is the rule:
A supremum always exists, but may be +. If a maximum exists, it is equal to the supremum.
For example,
sup1, 2, 3 = max 1, 2, 3 = 3,
supr; 0 < r < 3 = 3, (8)
sup1, 2, 3, = .
The inmum of a set o, denoted
inf o, (9)
is the largest number that is smaller than or equal to all members in the set.
The minimum is dened accordingly, and the rule is the same.
We will only meet sup and inf in connection with real numbers, although this can be dened for
other mathematical structures as well. As noted above, the existence of supremum and inmum
is quite fundamental for real numbers!
A set o of real numbers is bounded above if supo is nite (supo < ), and bounded below if inf o
is nite (< inf o). The set is bounded if both supo and inf o are nite.
2
2.2 Convergence of Sequences
A Cauchy sequence r
i

o
i=1
of real numbers is a sequence where
lim
ao
_
sup
na
[r
n
r
a
[
_
= 0. (10)
This denition is a bit tricky, but if you pick an - 0, I can always nd an :
.
such that
[r
n
r
a"
[ < - (11)
for all r
n
where : :
.
.
Another very basic property of real numbers is that all Cauchy sequences converge, that is,
lim
ao
r
a
= a (12)
for a (unique) real number a.
A sequence o = r
a

o
a=1
is monotonically increasing if
r
1
_ r
2
_ r
3
_ . (13)
A monotonically increasing sequence is always convergent,
lim
ao
r
a
= supo, (14)
(it may diverge to +). Thus, a monotonically increasing sequence that is bounded above, is
always convergent (You should try to prove this by applying the denition of sup and the denition
of a Cauchy sequence!).
Similar results also apply for monotonically decreasing sequences.
2.3 Compact Sets
A set o in R
a
is bounded if
sup
aS
|r| < . (15)
A Cauchy sequence o = r
a

o
a=1
R
a
is a sequence such that
lim
ao
_
sup
na
|r
n
r
a
|
_
= 0. (16)
It is easy to see, by noting that every component of the vectors is a sequence of real numbers,
that all Cauchy sequences in R
a
converge.
A set C in R
a
is closed if all Cauchy sequences that can be formed from elements in C converge to
elements in C. This may be a bit dicult to grasp: Can you see why the interval [0, 1] is closed,
while (0, 1) or (0, 1] are not? What about [0, )? Thus, a set is closed if it already contains all
the limits of its Cauchy sequences. By adding these limits to an arbitrary set C, we close it, and
write

C for the closure of C. For example,
(0, 1) = [0, 1] . (17)
3
Consider a bounded sequence o = r
a

o
i=1
in R, and assume for simplicity that
0 = inf o _ r
a
_ supo = 1. (18)
Split the interval [0 , 1] into half, say [0 ,
1
2
) and [
1
2
, 1]. Select one of these intervals containing
innitely many elements from o, and pick one r
a
1
o from the same interval. Repeat the
operation by halving this interval and selecting another element r
a
2
. Continue the same way. On
step /, the interval 1
I
will have length 2
I
and all later elements r
a
k
, r
a
k+1
, r
a
k+2
, will be
members of 1
I
. This makes the sub-sequence r
a
k

o
I=1
o into a Cauchy sequence (why?), and
hence it converges. A similar argument works for a sequence in R
a
.
A closed set with the property that all bounded sequences have convergent subsequences, is called
compact (this is a mathematical term, not really related to the everyday meaning of the word).
By an easy adaptation of the argument above, we have now proved that all bounded and closed
sets in R
a
are compact.
Of course, as long as the set above is bounded, r
a
k

o
I=1
will be convergent, but the limit may
not belong to the set, unless it is closed.
If you know the Hilbert space |
2
(or see below) consisting of all innite-dimensional vectors
r = c
1
, c
2
, such that |r|
2
=

o
i=1
[c
i
[
2
< , you will probably also know that the unit
ball, 1 = r ; |r| _ 1 is bounded (obvious) and closed (not so obvious). All unit vectors c
i

o
i=1
in an orthogonal basis will belong to 1. However, |c
i
c
)
|
2
= |c
i
|
2
+|c
)
|
2
= 2,whenever i ,= ,.
We have no convergent subsequences in this case, and 1 is not compact! This rather surprising
example occurs because |
2
has innite dimension.
2.4 O() and o() statements
It is convenient to write that the size of ) (r) is of the order of q (r) when r a in the short
form
) (r) = O(q (r)) , r a. (19)
Mathematically, this means that there exists two nite numbers, : and ' such that
:q (r) _ ) (r) _ 'q (r) (20)
when r a. In practice, we often use the notation to mean
[) (r)[ _ 'q (r) (21)
and assume that lower bound, not very much smaller than 'q (r) can be found. For example,
log (1 + r) r = O
_
r
2
_
when r 0.
The other symbol, o (), is slightly more precise: We say that ) (r) = o (q (r)) when r a if
lim
ao
) (r)
q (r)
= 0. (22)
4
2.5 The Taylor Formula
You should all be familiar with the Taylor series of a function q of one variable,
q (t) = q (t
0
) + q
t
(t
0
) (t t
0
) +
q
tt
(t
0
)
2!
(t t
0
)
2
+
q
ttt
(t
0
)
3!
(t t
0
)
3
+ . (23)
The Taylor Formula is not a series, but a quite useful nite identity. In essence, the Taylor
Formula gives an expression for the error between the function and its Taylor series truncated
after a nite number of terms.
We shall not dwell with the derivation of the formula, which follows by successive partial integra-
tions of the expression
q (t) = q (0) +
_
t
0
q
t
(:) d:, (24)
and the Integral Mean Value Theorem,
_
t
0
) (:) ,(:) d: = ) (t)
_
t
0
,(:) d:, , _ 0, ) continuous, (0, 1) .
The formulae below state for simplicity the results around t = 0, but any point is equally good.
The simplest and very useful form of Taylor Formula is also known as the Secant Formula:
If the derivative q
t
exists for all values between 0 and t, there is a (0, 1) such that
q (t) = q (0) + q
t
(t) t . (25)
This is an identity. However, since we do not know the value of , which in general depends on t,
we can not use it for computing q (t)! Nevertheless, the argument t is at least somewhere in the
open interval between 0 and t.
If q
t
is continuous at t = 0, we may write
q (t) = q (0) + q
t
(t) t
= q (0) + q
t
(0) t +
_
q
t
(t) q
t
(0)

t (26)
= q (0) + q
t
(0) t + o (t) .
Moreover, if q
tt
exists between 0 and t, we have the second order formula,
q (t) = q (0) + q
t
(0) t + q
tt
(t)
t
2
2!
(27)
(Try to prove this using the Integral Mean Value Theorem and assuming that q
tt
is continuous!
Be sure to use : t for the integral of d:).
Hence, if q
tt
is bounded,
q (t) = q (0) + q
t
(0) t +O
_
t
2
_
(28)
The general form of Taylor Formula, around 0 and with suciently smooth functions, reads
q (t) =
.

)=0
q
())
(0)
,!
t
)
+ 1
.
(t) , (29)
1
.
(t) =
_
t
0
q
(.+1)
(:)
!
(t :)
.
d: =
q
(.+1)
(t)
( + 1)!
t
.+1
, (0, 1) . (30)
5
2.6 The n-dimensional Taylor Formula
The :-dimensional Taylor formula will be quite important to us, and the derivation is based on
the one-dimensional formula above.
Let ) : R
a
R, and assume that \) exists around r = 0. Let us write q (:) = ) (:r). Then
q
t
() =
a

i=1
0)
0r
i
(r)
d (:r
i
)
d:
() =
a

i=1
0)
0r
i
(r) r
i
= \) (r) r, (31)
and we obtain
) (r) = q (1)
= q (0) + q
t
() 1 (32)
= ) (0) +\) (r) r, (0, 1) ,
which is the :-dimensional analogue of the Secant Formula. Note that the point r is somewhere
on the line segment between 0 and r, and that the same applies to all components of r (but
again, is an unknown function of r).
As above, if \) is continuous at r = 0,
) (r) = ) (0) +\) (0) r + (\) (r) \) (0)) r
= ) (0) +\) (0) r + o (|r|) . (33)
At this point we make an important digression. If a relation
) (r) = ) (r
0
) +\) (r
0
) (r r
0
) + o (|r r
0
|) (34)
holds at r
0
, we say that ) is dierentiable at r
0
. The linear function
T
a
0
(r)

= ) (r
0
) +\) (r
0
) (r r
0
) , (35)
is then called the tangent plane of ) at r
0
. Thus, for a dierentiable function,
) (r) = T
a
0
(r) + o (|r r
0
|) . (36)
Contrary what is stated in the rst edition of N&W (and numerous other non-mathematical
textbooks!), it is not sucient that all partial derivatives exist at r
0
(Think about this for a
while: The components of \) contain only partial derivatives of ) along the coordinate axis.
Find a function on R
2
where \) (0) = 0 but which, nevertheless, is not dierentiable at r = 0.
E.g., consider the function dened as sin20 in polar coordinates)
The next term of the :-dimensional Taylor Formula is derived similarly:
q
tt
() =
d
d:
a

i=1
0) (:r)
0r
i
r
i

c=
=
a

i,)=1
_
0
2
) (:r)
0r
i
0r
)
_

c=
r
)
r
i
= r
t
H (r) r. (37)
The matrix H is called the Hess matrix of ), or the Hessian,
H (r) = \
2
) (r) =
_
0
2
) (r)
0r
i
0r
)
_
a
i,)=1
. (38)
6
Yes, Optimization Theory uses sometimes the unfortunate notation \
2
) (r), which is not the
familiar Laplacian used in Physics and PDE theory!
From the above, the second order Taylor formula may now be written
) (r) = ) (0) +\) (0) r +
1
2
r
t
\
2
) (r) r, (0, 1) . (39)
Higher order terms get increasingly more complicated and are seldom used.
By truncating the :-dimensional Taylor series after the second term, we end up with what is
called a quadratic function, or a quadratic form,
(r) = a + /
t
r +
1
2
r
t
r. (40)
By considering quadratic functions we may analyze many important algorithms in optimization
theory analytically, and one very important case occurs if is positive denite. The function is
then convex (see below) and min (r) is obtained for the unique vector
r
+
=
1
/. (41)
We shall, from time to time, use the notation 0 to mean that the matrix is positive
denite (NB! This does not mean that all a
i)
0!). Similarly, _ 0 means that is positive
semidenite.
2.7 Matrix Norms
Positive denite matrices lead to what is called matrix (or skew) norms on R
a
. The matrix norms
are important in the analysis of the Steepest Descent Method, and above all, in the derivation of
the Conjugate Gradient Method.
Assume that is a symmetric positive denite : : matrix with eigenvalues
0 < `
1
_ `
2
_ _ `
a
, (42)
and a corresponding set of orthogonal and normalized eigenvectors c
i

a
i=1
. Any vector r R
a
may be expanded into a series of the form
r =
a

i=1
c
i
c
i
, (43)
and hence,
r =
a

i=1
c
i
c
i
=
a

i=1
c
i
`
i
c
i
, (44)
and
r
t
r =
a

i=1
c
2
i
`
i
. (45)
The -norm is dened
|r|


=
_
r
t
r
_
12
. (46)
Since
`
1
|r|
2
= `
1
a

i=1
c
2
i
_ r
t
r _ `
a
a

i=1
c
2
i
= `
a
|r|
2
, (47)
7
we observe that
`
12
1
|r| _ |r|

_ `
12
a
|r| , (48)
and the norms |r| = |r|
2
and |r|

are equivalent (as are any pair of norms in R


a
). The
verications of the norm properties are left for the reader:
(i) r = 0 ==|r|

= 0,
(ii) |cr|

= [c[ |r|

,
(iii) |r + j|

_ |r|

+|j|

.
(49)
In fact, R
a
even becomes a Hilbert space in this setting if we dene a corresponding inner product
,

as
j, r


= j
t
r. (50)
It is customary to say that r and j are -conjugate (or -orthogonal) if j, r

= 0.
2.8 Basic Facts About Hilbert Space
A Hilbert space H is a linear space, and for our applications, consisting of vectors or functions.
In case you have never heard about a Hilbert space, use what you know about R
a
.
It is rst of all a linear space, so that if r, j H and c, , R, also cr + ,j has a meaning and
is an element of H (We will not need complex spaces).
Furthermore, it has a scalar product , with its usual properties,
(i) r, j = j, r R,
(ii) cr + ,j, . = cr, . + , j, . .
(51)
We say that two elements r and j are orthogonal if r, j = 0.
The scalar product denes a norm,
|r| = r, r
12
, (52)
and makes H into a normed space (The nal, and a little more subtle property which completes
the denition of a Hilbert space, is that it is complete with respect to the norm, i.e. it is also
what is called a Banach space).
A Hilbert space may be nite dimensional, like R
a
, or innite dimensional, like |
2
(N) (This space
consists of all innitely dimensional vectors r = r
i

o
i=1
, where

o
i=1
[r
i
[
2
< ).
Important properties of any Hilbert space include
The Schwarz Inequality: [r, j[ _ |r| |j|
The Pythagorean Formula: If r, j = 0, then |r + j|
2
= |r|
2
+|j|
2
However, the really big theorem in Hilbert spaces related to optimization theory is the Projection
Theorem:
The Projection Theorem: If H
0
is a closed subspace of H and r H, then min
j1
0
|r j|
is obtained for a unique vector j
0
H
0
, where
j
0
is orthogonal to the error c = r j
0
, that is, j
0
, c = 0,
8
j
0
is the best approximation to r in H
0
.
The theorem is often stated by saying that any vector in H may be written in a unique way as
r = j
0
+ c, (53)
where j
0
H
0
, and j
0
and c are orthogonal.
The projection theorem is by far the most important practical result about Hilbert spaces. It forms
the basis of everyday control theory and signal processing algorithms (e.g., dynamic positioning,
noise reduction and optimal ltering).
Our Hilbert spaces will have sets of orthogonal vectors of norm one, c
i
, such that any r H
may be written as a series,
r =

i
c
i
c
i
,
c
i
= r, c
i
, i = 1, 2, (54)
The set c
i
is called a basis. Note also that
|r|
2
=

i
c
2
i
. (55)
If H
a
is the subspace spanned by c
1
, , c
a
, that is
H
a
= spanc
1
, , c
a
=
_
j ; j =
a

i=1
,
i
c
i
, ,
i
R
a
_
, (56)
then the series of any r H, truncated at term :, is the best approximation to r in H
a
,
a

i=1
c
i
c
i
= arg min
j1n
|r j| . (57)
If you ever need some Hilbert space theory, the above will probably cover it.
3 THE OPTIMIZATION SETTING
Since there is no need to repeat a result for maxima if we have proved it for minima, we shall
only consider minima in this course. That is, we consider the problem
min
a
) (r) . (58)
where is called the feasible domain.
The denitions of local, global, and strict minima should be known to the readers, but we repeat
them here for completeness.
r
+
is a local minimum if there is a neighborhood of r
+
such that ) (r
+
) _ ) (r) for all
r .
r
+
is a global minimum if ) (r
+
) _ ) (r) for all r .
9
A local minimum r
+
is strict (or an isolated) local minimum if there is an such that
) (r
+
) < ) (r) for all r , r ,= r
+
.
It is convenient to use the notation
r
+
= arg min
aR
n
) (r) (59)
for a solution r
+
of (58). If there is only one minimum, which is then both global and strict, we
say it is unique.
3.1 The Existence Theorem for Minima
As we saw for some trivial cases above, a minimum does not necessarily exist. So what about a
criterion for existence? The following result is fundamental:
Assume that ) is a continuous function dened on a closed and bounded set R
a
. Then there
exists r
+
such that
) (r
+
) = min
a
) (r) . (60)
This theorem, which states that the minimum (and not only an inmum) really exists, is the most
basic existence theorem for minima that we have. A parallel version exists for maxima.
Because of this result, we always prefer that the domain we are taking the minimum or maximum
over is bounded and closed (Later in the text, when we consider a domain , think of it as closed).
Let us look at the proof. We rst establish that ) is bounded below over , that is, inf
a
) (r) is
nite. Assume the opposite. Then there are r
a
such that ) (r
a
) < :, : = 1, 2, 3 . Hence
lim
ao
) (r
a
) = . At the same time, since was bounded and closed, there are convergent
subsequences, say lim
Io
r
a
k
= r
0
. But lim
Io
) (r
a
k
) = , = ) (r
0
); thus contradicting
that ) is continuous, and hence nite at r
0
.
Since ) is bounded below, we know that there is an a R such that
a = inf
a
) (r) . (61)
Since a is the largest number that is less or equal to ) (r) for all r , we also know that for
any :, there must be an r
a
such that
) (r
a
) < a +
1
:
(62)
(think about it!).
We thus obtain, as above, a sequence r
a
that has a convergent subsequence r
a
k

o
I=1
,
lim
Io
r
a
k
= r
0
. (63)
Since ) is continuous, we also have
) (r
a
k
)
Io
) (r
0
) . (64)
On the other hand,
a _ ) (r
a
k
) < a +
1
:
I
. (65)
10

(a) (b)

(a) (b)
Figure 1: (a) Feasible directions in the interior and on the boundary of . (b) Feasible directions when
(the circle itself, not the disc!) does not contain any line segment.
Hence
) (r
0
) = a. (66)
But this means that
) (r
0
) = a = inf
a
) (r) = min
a
) (r) , (67)
which is exactly what we set out to prove!
3.2 The Directional Derivative and Feasible Directions
Consider a function ) : R
a
R as above. The directional derivative of ) at r in the direction
d ,= 0 is dened as
c) (r, d) = lim
.0+
) (r + -d) ) (r)
-
. (68)
Assume that \) is continuous around r. Then, from Taylors Formula,
c) (r, d) = lim
.0+
) (r + -d) ) (r)
-
= lim
.0+
\) (r + -d) (-d)
-
= \) (r) d, (69)
which is the important formula for applications. The notation c) (r, d) contains both a point r
and a direction d out from r. Note that the denition does not require that |d| = 1 and that the
answer depends on |d|. The directional derivative exists where ordinary derivatives dont, like
for ) (r) = [r[ at the origin (What is c [r[ (0, d)?).
If we consider a domain R
a
and r , a feasible direction out from r is a vector d pointing
into , as illustrated in Fig. 1 (a). Note that the length of d is of no importance for the existence,
since r + -d will be in when - is small enough. At an interior point (surrounded by a ball in
R
a
that is also in ), all directions will be feasible.
It will later be convenient also to consider limiting feasible directions, as shown in Fig. 1(b): A
direction d is feasible if there exists a continuous curve (t) , where (0) = r, so that
d
|d|
= lim
t0+
(t) r
| (t) r|
. (70)
11
3.3 First and Second Order Conditions for Minima
First order conditions deal with rst derivatives.
The following result is basic: If c) (r, d) < 0, there is an -
0
such that
) (r + -d) < ) (r) for all - (0, -
0
). (71)
In particular, such a point can not be a minimum! The proof is simple: Since c) (r, d) < 0, also
) (r + -d) ) (r)
-
< 0 for all - (0, -
0
) (72)
when -
0
is small enough.
Corollary 1: If r
+
is a local minimum for ) (r) where directional derivatives exist, then c) (r
+
, d) _
0 for all feasible directions.
Otherwise, we can walk out from r
+
in a direction d where c) (r
+
, d) < 0.
Corollary 2: If r
+
is a local minimum for ) (r) , and \) is continuous around r
+
, then
\) (r
+
) d _ 0 for all feasible directions.
Yes, in that case, c) (r
+
, d) is simply equal to \) (r
+
) d.
Corollary 3 (N&W, Thm. 2.2): If r
+
is an interior local minimum for ) (r) where \)
exists, then \) (r
+
) = 0.
Assume that, e.g.
0)
0a
j
(r
+
) ,= 0. Then one of the directional derivatives (in the r
)
or r
)
-direction)
are negative.
Corollaries 13 state necessary conditions; they will not guarantee that r
+
is really a minimum
(Think of ) (r) = r
3
at r = 0).
The second order conditions bring in the Hessian, and the rst result is Thm. 2.3 in N&W:
If r
+
is an interior local minimum and \
2
) is continuous around r
+
, then \
2
) (r
+
) is positive
semidenite (\
2
) (r
+
) _ 0).
The argument is again by contradiction: Assume that d
t
\
2
) (r
+
) d = a < 0 for some d ,= 0.
Since \) (r
+
) d = 0 (Corollary 3), it follows from Taylor Formula that
) (r
+
+ -d) ) (r
+
)
-
2
=
1
2
d
t
\
2
) (r
+
+ -d) d
.0
1
2
a < 0. (73)
Thus, there is an -
0
such that
) (r
+
+ -d) < ) (r
+
) (74)
for all - (0, -
0
), and r
+
can not be a minimum.
However, contrary to the rst order conditions, the slightly stronger property that \
2
) (r
+
) is
positive denite, \
2
) (r
+
) 0, and not only semidenite, gives a sucient condition for a strict
local minimum:
Assume that \
2
) is continuous around r
+
, \) (r
+
) = 0, and \
2
) (r
+
) 0,then r
+
is a strict
local minimum.
Since \
2
) is continuous and \
2
) (r
+
) 0, it will even be positive denite in a neighborhood of
r
+
, say |r r
+
| < c (The eigenvalues are continuous functions of the matrix elements, which in
turn are continuous functions of r). Then, for 0 < |j| < c,
) (r
+
+ j) ) (r
+
) = \) (r
+
) j +
1
2
j
t
\
2
) (r
+
+ j) j
= 0 +
1
2
j
t
\
2
) (r
+
+ j) j 0. (75)
12
Convex Not convex
Figure 2: For a convex set, all straight line segments connecting two points are contained in the set.
Thus, r
+
is a strict local minimum.
Simple counter-examples show that only \
2
) (r
+
) _ 0 is not sucient: Check ) (r, j) = r
2
j
3
.
To sum up, the possible minima of ) (r) are at points r
0
where c) (r
0
, d) _ 0 for all feasible
directions. In particular, if \) (r) exists and is continuous, possible candidates are
interior points where \) (r) = 0,
points on the boundary where \) (r) d _ 0 for all feasible directions.
4 BASIC CONVEXITY
Convexity is one of the most important concepts in optimization. Although the results here are
all quite simple and obvious, they are nevertheless very powerful.
4.1 Convex Sets
A convex set in R
a
is a set having the following property:
If r, j , then 0r + (1 0) j for all 0 (0, 1).
The concept can be generalized to all kind of sets (functions, matrices, stochastic variables, etc.),
where a combination of the form 0r + (1 0) j makes sense.
It is convenient, but not of much practical use, to dene the empty set as convex.
Note that a convex set has to be connected, and can not consist of isolated subsets.
Determine which of the following sets are convex:
The space R
2
(r, j) R
2
; r
2
+ 2j
2
_ 2
(r, j) R
2
; r
2
2j
2
_ 2
r R
a
; r _ /, / R
n
and R
na

One basic theorem about convex sets is the following:


13
Theorem 1: If
1
, ,
.
1
a
are convex sets, then

1

a
=
.

i=1

i
(76)
is convex.
Proof: Choose two points r, j
.
i=1

i
. Then 0r + (1 0) j
i
for i = 1, , , that is,
0r + (1 0) j
.
i=1

i
.
Thus, intersections of convex sets are convex!
4.2 Convex Functions
A real-valued function ) is convex on the convex set if for all r
1
, r
2
,
) (0r
1
+ (1 0) r
2
) _ 0) (r
1
) + (1 0) ) (r
2
) , 0 (0, 1) . (77)
Consider the graph of ) in R and the connecting line segment from (r
1
, ) (r
1
)) to (r
2
, ) (r
2
)),
consisting of the following points in R
a+1
:
0r
1
+ (1 0) r
2
,
0) (r
1
) + (1 0) ) (r
2
) , 0 (0, 1) .
The function is convex if all such line segments lie on or above the graph. Note that a linear
function, say
) (r) = /
t
r + a, (78)
is convex according to this denition, since in that particular case, Eqn. 77 will always be an
equality.
When the inequality in Eqn. 77 is strict, that is, we have "<" instead of "_", then we say that
the function is strictly convex. A linear function is convex, but not strictly convex.
Note that a convex function may not be continuous: Let = [0, ) and ) be the function
) (r) =
_
1, r = 0,
0, r 0.
(79)
Show that ) is convex. This example is a bit strange, and we shall only consider continuous
convex functions in the following.
Proposition 1: If ) and q are convex, and c, , _ 0, then c) + ,q is convex (on the common
convex domain where both ) and q are dened).
Idea of proof: Show that c) + ,q satises the denition in Eqn. 77.
What is the conclusion in Proposition 1 if at least one of the functions are strictly convex and c,
, 0? Can Proposition 1 be generalized?
Proposition 2: If ) is convex, then the set
( = r; ) (r) _ c (80)
is convex.
14
Figure 3: Simple examples of graphs of convex and strictly convex functions (should be used only as
mental images!).
Proof : Assume that r
1
, r
2
(. Then
) (0r
1
+ (1 0) r
2
) _ 0) (r
1
) + (1 0) ) (r
2
)
_ 0c + (1 0) c = c. (81)
This proposition has an important corollary for sets dened by several inequalities:
Corollary 1: Assume that the functions )
1
, )
2
, , )
n
, are convex. Then the set
= r ; )
1
(r) _ c
1
, )
2
(r) _ c
2
, , )
n
(r) _ c
n
(82)
is convex.
Try to show that the maximum of a collection of convex functions, q (r) = max
i
)
i
(r), is also
convex.
We recall that dierentiable functions had tangent planes
T
a
0
(r) = ) (r
0
) +\) (r
0
) (r r
0
) , (83)
and
) (r) T
a
0
(r) = o (|r r
0
|) . (84)
Proposition 3: A dierentiable function on the convex set is convex if and only if its graph
lies above its tangent planes.
Proof: Let us start by assuming that ) is convex and r
0
. Then
\) (r
0
) (r r
0
) = c) (r
0
; r r
0
) = lim
.0
) (r
0
+ - (r r
0
)) ) (r
0
)
-
_ lim
.0
[(1 -) ) (r
0
) + -) (r)] ) (r
0
)
-
(85)
= ) (r) ) (r
0
) .
Thus,
) (r) _ ) (r
0
) +\) (r
0
) (r r
0
) = T
a
0
(r) . (86)
15
Connection lines
Tangent planes
x
z=f(x)
Connection lines
Tangent planes
x
z=f(x)
Figure 4: A useful mental image of a convex function: Connecting line segments above, and tangent
planes below the graph!
For the opposite, assume that the graph of ) lies above its tangent planes. Consider two arbitrary
points r
1
and r
2
in and a point r
0
on the line segment between them, r
0
= 0r
1
+ (1 0) r
2
.
Then
) (r
1
) _ ) (r
0
) +\) (r
0
) (r
1
r
0
) ,
) (r
2
) _ ) (r
0
) +\) (r
0
) (r
2
r
0
) . (87)
Multiply the rst equation by 0 and the last by (1 0) and show that this implies that
0) (r
1
) + (1 0) ) (r
2
) _ ) (r
0
) , (88)
which is exactly the property that shows that ) is convex.
The rule to remember is therefore:
The graph of a (dierentiable) convex function lies above all its tangent planes and below the line
segments between arbitrary points on the graph.
The following proposition assumes that the second order derivatives of ), that is, the Hessian
\
2
), exists in . We leave out the proof, which is not dicult:
Proposition 4: A smooth function ) dened on a convex set is convex if and only if \
2
) is
positive semi-denite in . Moreover, ) will be strictly convex if \
2
) is positive denite.
The opposite of convex is concave. The denition should be obvious. Most functions occurring in
practice are either convex and concave locally, but not for their whole domain of denition.
All results above have counterparts for concave functions.
4.3 The Main Theorem Connecting Convexity and Optimization
The results about minimization of convex functions dened on convex sets are simple, but very
powerful:
Theorem 2: Let ) be a convex function dened on the convex set . If ) has minima in ,
these are global minima and the set of minima,
=
_
j ; ) (j) = min
a
) (r)
_
(89)
is convex.
16
Note 1: Let = R and ) (r) = c
a
. In this case the convex function ) (r) dened on the convex
set R has no minima.
Note 2: Note that itself is convex: All minima are collected at one place. There are no isolated
local minima here and there!
Proof: Assume that r
0
is a minimum which is not a global minimum. We then know there is
a j where ) (j) < ) (r
0
). The line segment going from (r
0
, ) (r
0
)) to (j, ) (j)) is therefore
sloping downward. However, because ) is convex,
) (0r
0
+ (1 0) j) _ 0) (r
0
) + (1 0) ) (j) < ) (r
0
) , (90)
for all 0 [0, 1). Hence, r
0
can not be a local minimum, but a global minimum!
Assume that ) (r
0
) = c. Then
=
_
j ; ) (j) = min
a
) (r)
_
= j ; ) (j) = c (91)
= j ; ) (j) _ c ,
is convex by Proposition 2.
Corollary 1: Assume that ) is a convex function on the convex set and assume that the
directional derivatives exist at r
0
. Then r
0
belongs to the set of global minima of ) (r) in if
and only if c) (r
0
, d) _ 0 for all feasible directions.
Proof: We already know that c) (r
0
, d) would be nonnegative if r
0
is a (global) minimum, so
assume that r
0
is not a global minimum. Then ) (j) < ) (r
0
) for some j , and d = j r
0
is
a feasible direction (why?). But this implies that
c) (r
0
, j r
0
) = lim
.0+
) (r
0
+ - (j r
0
)) ) (r
0
)
-
_ lim
.0+
-) (j) + (1 -) ) (r
0
) ) (r
0
)
-
= ) (j) ) (r
0
) < 0. (92)
Corollary 2: Assume, that ) is a dierentiable convex function on the convex set and that
\) (r
0
) = 0. Then r
0
belongs to the set of global minima of ) (r) in .
Proof: Here c) (r
0
, d) = \) (r
0
) d = 0 (which is larger or equal to 0!).
Note that if ) is convex on the convex set , and c) (r, j r) exists for all r, j , then
inequality (92) may be written
) (j) _ ) (r) + c) (r, j r) .
Life is easy when the functions are convex, and one usually puts quite some eort either into
formulating the problem so that it is convex, or tries to prove that for the problem at hand!
4.4 JENSENS INEQUALITY AND APPLICATIONS
Jensens Inequality is a classic result in mathematical analysis where convexity plays an essential
role. The inequality may be extended to a double-inequality which is equally simple to derive.
17
( )

1
w
2
w
3
w
n
w
( ) = z
( ) ( ) ,
( ) ( ) ( )
n
+
1
1
Figure 5: Think of the points as mass-particles and determine their center of gravity!
The inequality is among the few statements in mathematics where the proof is easier to remember
than the result itself!
Let , be a convex function, , : R R. We rst consider the discrete case where `
1
_ _ `
a
,
and n
i

a
i=1
are positive numbers. Jensens double-inequality then goes as follows:
,
_

`
_
_ ,(`) _ (1 0) ,(`
1
) + 0,(`
a
) , (93)
where

` =

a
i=1
n
i
`
i

a
i=1
n
i
,
,(`) =

a
i=1
n
i
,(`
i
)

a
i=1
n
i
, (94)
0 =

` `
1
`
a
`
1
.
The name "Jensens double inequality" is not very common, but suitable since there are two
(non-trivial) inequalities involved.
The proof may be read directly out from Fig. 5, thinking in pure mechanical terms: The center
of gravity for the : mass points at `
i
, ,(`
i
)
a
i=1
with weights n
i

a
i=1
, is located at
_

`, ,(`)
_
.
Because of the convexity of ,, the ordinate ,(`) has to be somewhere between ,
_

`
_
and |
_

`
_
,
that is, the point corresponding to

` on the line segment joining (`
1
, ,(`
1
)) and (`
a
, ,(`
a
)).
That is all !
It is the left part of the double inequality that traditionally is called Jensens Inequality.
Also try to write the inequality in the case when n is a positive function of `, and derive the
following inequality for a real stochastic variable:
exp(EA) _ E(exp(A)) (95)
(Hint: The mass density is now the probability density n(`) for the variable, and recall that
EA =
_
o
o
`n(`) d`).
18
A lot of inequalities are derived from the left hand side of Jensens double-inequality. However,
the Kantorovitch Inequality, discussed next is an exception, since it is based on the right hand
part of the inequality.
4.4.1 Application 1: Kantorovitch Inequality
The Kantorovitch Inequality goes as follows:
If is a positive denite matrix with eigenvalues `
1
_ `
2
_ `
a
, then
|r|
2

|r|
2

1
|r|
4
_
1
4
(`
1
+ `
a
)
2
`
1
`
a
. (96)
Since the inequality is invariant with respect to the norm of r, we shall assume that r =

a
i=1
c
i
c
i
,
and set n
i
= c
2
i
so that
a

i=1
n
i
= |r|
2
= 1. (97)
Since we are on the positive real axis, the function ,(`) =
1
A
is convex, and
|r|
2

= r
t
r =
a

i=1
`
i
n
i
=

`,
|r|
2

1 = r
t

1
r =
a

i=1
1
`
i
n
i
= ,(`). (98)
Thus, by applying the RHS of Jensens double-inequality,
|r|
2

|r|
2

1 =

`,(`)
_

`
_
(1 0)
1
`
1
+ 0
1
`
a
_
(99)
=

`
__
1

` `
1
`
a
`
1
_
1
`
1
+

` `
1
`
a
`
1
1
`
a
_
.
The right hand side is a second order polynomial in

` with a maximum value,
1
4
(`
1
+ `
a
)
2
`
1
`
a
, (100)
attained for

` = (`
1
+ `
2
) ,2 (Check it!). This proves the inequality.
Show that the inequality can not, in general, be improved by considering equal to the 2 2
unit matrix.
4.4.2 Application 2: The Convergence of the Steepest Descent Method
It will in general be reasonable to assume that ) has the form
) (r) = ) (r
+
) +\) (r
+
) (r r
+
) +
1
2
(r r
+
)
t
\
2
) (r
+
) (r r
+
) + (101)
near a local minimum r
+
. The convergence can therefore be studied in terms of the Test problem
min
a
) (r) , (102)
19
where
) (r) = /
t
r +
1
2
r
t
r, 0.
We know that the gradient direction q = (\))
t
in this case is equal to / + r, and the Hessian
\
2
) is equal to . The problem has a unique solution for / + r = 0, that is, r
+
=
1
/.
At a certain point r
I
, the steepest descent is along the direction q
I
= (/ + r
I
). We therefore
have to solve the one-dimensional sub-problem
c
I
= arg min
c
) (r
I
cq
I
) .
It is easy to see that the minimum is attained at a point
r
I+1
= r
I
c
I
q
I
, (103)
where the level curves (contours) of ) are parallel to q
I
, that is,
\) (r
I+1
) q
I
= 0, (104)
or q
t
I+1
q
I
= 0. This gives us the equation
[/ + (r
I
c
I
q
I
)]
t
q
I
=
(q
I
c
I
q
I
)
t
q
I
= 0, (105)
or
c
I
=
q
t
I
q
I
q
t
I
q
I
=
|q
I
|
|q
I
|

. (106)
The algorithm, which at the same time is an iterative method for the system r = /, goes as
follows:
Given r
1
and q
1
= / + r
1
.
for / = 1 until convergence do
c
I
=
j
0
k
j
k
j
0
k
(j
k
)
r
I+1
= r
I
c
I
q
I
q
I+1
= q
I
c
I
(q
I
)
end
In order to get an estimate of the error on step /, we note that

1
q
I
=
1
(/ + r
I
) = r
+
+ r
I
. (107)
Hence,
|r
I
r
+
|
2

=
_

1
q
I
_
t

1
q
I
_
= |q
I
|
2

1 , (108)
and
|r
I+1
r
+
|
2

|r
I
r
+
|
2

=
|q
I+1
|
2

1
|q
I
|
2

1
. (109)
20
Let us look at |q
I+1
|
2

1 on the right hand side:


|q
I+1
|
2

1 = q
t
I+1

1
(q
I
c
I
(q
I
))
= q
t
I+1

1
q
I
c
I
q
t
I+1
q
I
= q
t
I+1

1
q
I
(110)
= (q
I
c
I
(q
I
))
t

1
q
I
= q
I

1
q
I

(q
t
I
q
I
)
2
q
t
I
(q
I
)
= |q
I
|
2

1
|q
I
|
4
|q
I
|
2

.
Thus,
|r
I+1
r
+
|
2

|r
I
r
+
|
2

=
|q
I+1
|
2

1
|q
I
|
2

1
= 1
|q
I
|
4
|q
I
|
2

1 |q
I
|
2

_ 1
4`
1
`
a
(`
1
+ `
a
)
2
(111)
=
_
`
1
`
a
`
1
+ `
a
_
2
=
_
i 1
i + 1
_
2
,
where Kantorovitch Inequality was applied for the inequality in the middle. We recognize i =
`
a
,`
1
as the condition number for the Hessian .
If the condition number of the Hessian is large, the convergence of the steepest descent method
may be very slow!
5 THE IMPLICIT FUNCTION THEOREM
The Implicit Function Theorem is a classical result in mathematical analysis. This means that
all mathematicians know it, but cant really recall where they learnt it. The theorem may be
stated in dierent ways, and it is not so simple to see the connection between the formulation in,
e.g. N&W (Theorem A1, p. 585) and Luenberger (s. 4623). In this short note we rst state the
theorem and try to explain why it is reasonable. Then we give a short proof based on Taylors
Formula and Banachs Fixed-point Theorem.
An implicit function is a function dened in terms of an equation, say
r
2
+ j
2
1 = 0. (112)
Given a general equation /(r, j) = 0, it is natural to ask whether it is possible to write this as
j = )(r). For Eqn. 112, it works well locally around a solution (r
0
, j
0
), except for the points
(1, 0) and (1, 0). In more dicult situations it may not be so obvious, and then the Implicit
Function Theorem is valuable.
The Implicit Function Theorem tells us that if we have an equation /(r, j) = 0 and a solution
(r
0
, j
0
), /(r
0
, j
0
) = 0, then there exists (if the conditions of the theorem are valid) a neighborhood
21
A around r
0
such that we may write
j = )(r),
/(r, ) (r)) = 0, for all r A. (113)
The theorem guarantees that ) exists, but does not solve the equation for us, and does not say
in a simple way how large A is.
Consider the implicit function equation
r
2
j
2
= 0 (114)
to see that we only nd solutions in a neighborhood of a known solution, and that we, in this
particular case, will have problems at the origin.
We are going to present a somewhat simplied version of the theorem which, however, is general
enough to show the essentials.
Let
/(r, j) = 0 (115)
be an equation involving the :-dimensional vector j and the :-dimensional vector r. Assume
that / is :-dimensional, such that there is hope that a solution with respect to j exists. We thus
have : nonlinear scalar equations for the : unknown components of j.
Assume we know at least one solution (r
0
, j
0
) of Eqn. 115, and by moving the origin to (r
0
, j
0
),
we may assume that this solution is the origin, /(0, 0) = 0. Let the matrix 1 be the Jacobian of
/ with respect to j at (0, 0):
1 =
0/
0j
(0) =
0/
i
0j
)
(0). (116)
The Implicit Function Theorem may then be stated as follows:
Assume that / is a dierentiable function with continuous derivatives both in r and j. If the
matrix 1 is non-singular, there is a neighborhood A around r = 0, where we can write j = )(r)
for a dierentiable function ) such that
/(r, )(r)) = 0, r A. (117)
The theorem is not unreasonable: Consider the Taylor expansion of / around (0, 0):
/(r, j) = /(0, 0) + r + 1j + o (|r| , |j|)
= r + 1j + o (|r| , |j|) . (118)
The matrix is the Jacobian of / with respect to r, and 1 is the matrix above. To the rst
order, we thus have to solve the equation
r + 1j = 0, (119)
with respect to j, and if 1 is non-singular, this is simply
j = 1
1
r. (120)
The full proof of the Implicit Function Theorem is technical, and it is perfectly OK to stop the
reading here!
22
For the brave, we start by stating Taylors Formula to rst order for a vector valued function
j = q (r), r R
a
, j R
n
:
q(r) = q(r
0
) +\q(r
0
)(r r
0
),
r
0
= 0r
0
+ (1 0)r. (121)
Note that since q has : components, \q(r
0
) is an :: matrix (the Jacobian),
\q(r
0
) =
_

_
\q
1
(r
0
1
)
\q
2
(r
0
2
)
.
.
.
\q
n
(r
0m
)
_

_
, (122)
and 0 is a matrix, 0 = diag 0
1
, , 0
n
. We shall assume that all gradients are continuous as
well, and hence
q(r) = q(r
0
) +\q(r
0
)(r r
0
) + (\q(r
0
) \q(r
0
))(r r
0
)
= q(r
0
) +\q(r
0
)(r r
0
) + a(r, r
0
)(r r
0
) (123)
where a(r, r
0
)
aa
0
0.
Put
(r, j) = /(r, j) r 1j, (124)
where, as above, = 0/,0r(0) and 1 = 0/,0j(0). From Taylors Formula,
(r, j) = a(r, j)r + /(r, j)j, (125)
where both a and / tend to 0 when r, j 0 . Thus, for any positive -, there are neighborhoods
1(r, r
a
) = r; |r| < r
a
,
1(j, r
j
) = j; |j| < r
j
, (126)
such that
(i) |(r, j)| _ - |r| + - |j| , r 1(r, r
a
), j 1(j, r
j
),
(ii) |(r
1
, j
1
) (r
2
, j
2
)| _ - |r
1
r
2
| + - |j
1
j
2
| ,
r
1
, r
2
1(r, r
a
), j
1
, j
2
1(j, r
j
).
(127)
We now dene the non-linear mapping j T (j) as
j T(j)

= 1
1
r 1
1
(r, j), (128)
and will show that this mapping is a contraction on 1(j, r
j
) for all r 1(r, r
a
). This is the core
of the proof.
Choose - so small that - + [[1
1
[[- < 1. Then, nd r
a
and r
j
such that (i) and (ii) hold, and
also ensure that r
a
is so small that
r
a
<
-
[[1
1
[[ +[[1
1
[[-
r
j
. (129)
Let j 1(j, r
j
) and r 1(r, r
a
). Then,
[[T(j)[[ = [[ 1
1
r 1
1
(r, j)[[
_ [[1
1
[[[[r[[ +[[1
1
[[ (- |r| + - |j|)
_ ([[1
1
[[ +[[1
1
[[-)r
a
+[[1
1
[[-r
j
(130)
_ -r
j
+[[1
1
[[-r
j
_ r
j
.
23
Thus T (1(j, r
j
)) 1(j, r
j
). Moreover,
|T(j
1
) T(j
2
)| = [[1
1
((r, j
1
) (r, j
2
)) [[
_ - [[1
1
[[ |j
1
j
2
| (131)
< (1 -) |j
1
j
2
| .
The Banach Fixed-point Theorem now guarantee solutions j
0
1(j, r
j
) fullling
j
0
= T (j
0
) = 1
1
r 1
1
(r, j
0
), (132)
or
r + 1j
0
+ (r, j
0
) = /(r, j
0
) = 0 (133)
for all r 1(r, r
a
)!
This proves the existence of the function r )(r) = j
0
in the theorem for all r 1(r, r
a
).
The continuity is simple:
j
2
j
1
= 1
1
(r
2
r
1
) 1
1
((r
2
, j
2
) (r
1
, j
1
)) , (134)
giving
|j
2
j
1
| _
_
_
1
1

_
_
|r
2
r
1
| +
_
_
1
1
_
_
(- |r
2
r
1
| + - |j
2
j
1
|) , (135)
and hence
|j
2
j
1
| _
_
_
1
1

_
_
+
_
_
1
1
_
_
-
1 |1
1
| -
|r
2
r
1
| .
Dierentiability of ) in the origin follows from the denition and (ii ) above. Proof of the dif-
ferentiability in other neighboring locations is simply to move the origin there and repeat the
proof.
Luenberger gives a more complete and precise version of the theorem. The smoothness of )
depends on the smoothness of /.
A nal word: Remember the theorem by recalling the equation
r + 1j = 0, (136)
where = 0/,0r(0) and 1 = 0/,0j(0).
6 REFERENCES
Luenberger, D. G.: Linear and Nonlinear Programming, 2nd ed., Addison Westley, 1984.
24

S-ar putea să vă placă și