Advanced Algorithms in Nonlinear Optimization

Advanced Algorithms in Nonlinear Optimization
Philippe Toint
Department of Mathematics, University of Namur, Belgium
( philippe.toint@fundp.ac.be )
Belgian Francqui Chair, Leuven, April 2009
Outline
1
Nonlinear optimization: motivation, past and perspectives
Trust region methods for unconstrained problems
Derivative free optimization, filters and other topics
Convex constraints and interior-point methods
The use of problem structure for large-scale applications
Regularization methods and nonlinear step control
Conclusions
Philippe Toint (Namur)
April 2009
2 / 323
Acknowledgements
This course would not have been possible without
the Francqui Foundation and the Katholieke Universiteit Leuven,
Moritz Diehl, Dirk Roose and Stefan Vandewalle (the gentle
organizers),
Fabian Bastin, Stefania Bellavia, Cinzia Cirillo, Coralia Cartis, Andy
Conn, Nick Gould, Serge Gratton, Sven Leyffer, Vincent Malmedy,
Benedetta Morini, Melodie Mouffe, Annick Sartenaer, Katya
Scheinberg, Dimitri Tomanos, Melissa Weber-Mendonca (my patient
co-authors).
Ke Chen, Patrick Laloyaux (who supplied pictures)
My grateful thanks to them all.
April 2009
3 / 323
Definition and examples
What is optimization?
The best choice subject to constraints
best
choice
constraints
criterion, objective function

variables whose value may be chosen
restrictions on allowed values of the variables
April 2009
4 / 323
More formally
variables
objective function
constraints
x = (x1 , x2 , . . . , xn )
minimize/maximize f (x)
c(x) 0
Note: maximize f (x) equivalent to minimize f (x).

min f (x)
x
such that
c(x) 0
(the general nonlinear optimization problem)
(+ conditions on x, f and c)
April 2009
5 / 323
Nature optimizes
April 2009
6 / 323
People optimize (daily)
April 2009
7 / 323
Applications: PAL design (1)

Design of modern Progressive Adaptive Lenses:
vary optical power of lenses while minimizing astigmatism
a)
power
b)
astigmatism
Loos, Greiner, Seidel (1997)

April 2009
8 / 323

Achievements: Loos, Greiner, Seidel (1997)
uncorrected
long distance
short distance
PAL
April 2009
9 / 323

Is this nonlinear ( difficult)?
Assume the lens surface is z = z(x, y ). The optical power is
"
#
2 ! 2
2 ! 2
z
z
z
N3
z
z z 2 z
1+
p(x, y ) =
+ 1+
2
2
x
y 2
y
x 2
x y xy
where
1
N = N(x, y ) = r
h i2 h i2 .
1 + z + z
x
y
The surface astigmatism is then

v
u
u
a(x, y ) = 2tp(x, y ) N 4
2 2 !
z z
z
x y
xy
April 2009
10 / 323
Applications: Food sterilization (1)

A common problem in the food processing industry:
keep a max of vitamins while killing a prescribed fraction of the bacteria
heating in steam/hot water autoclaves
Sachs (2003)
April 2009
11 / 323
Applications: Food sterilization (2)

Model: coupled PDEs
Concentration of micro-organisms and other nutrients:
C
(x, t) = K [(x, t)]C (x, t),
t
where (x, t) is the temperature, and where
K [] = K1 e
K2
1
1
(Arrhenius equation)
Evolution of temperature:
c()
= [k()],
t
(with suitable boundary conditions: coolant, initial temperature,. . . )

April 2009
12 / 323
Applications: biological parameter estimation (1)

K-channel in a the model of a neuron membrane:
Sansom (2001)
Doyle et al. (1998)

April 2009
13 / 323

Where are these neurons?
in a Pacific spiny lobster!
Simmers, Meyrand and Moulin (1995)

April 2009
14 / 323

After gathering experimental data (applying a current to the cell):
estimate the biological model parameters that best fit experiments
Model:
Activation: p independent gates
Deactivation: nh gates with different dynamics
nh + 2 coupled ODEs for the voltage, the activation level, the partial
inactivations levels
5-points BDF for 50000 time steps
very nonlinear!
April 2009
15 / 323
Applications: data assimilation for weather forecasting (1)
(Attempt to) predict. . .

tomorrows weather
the oceans average temperature
next month
future gravity field
future currents in the ionosphere
...
April 2009
16 / 323

Data: temperature, wind, pressure, . . . everywhere and at all times!
May involve up to 250000000 variables!

April 2009
17 / 323

The principle:
0.8
Known situation 2.5 days ago

and background prediction
0.6
0.4
0.2
0.2
0.4
2.5
1.5
0.5
0.5
temp. vs. days
April 2009
18 / 323

The principle:
0.8

Record temperature for the past 2.5 days
0.6
0.4
0.2
0.2
0.4
2.5
1.5
0.5
0.5
temp. vs. days
April 2009
18 / 323

The principle:
Minimize deviation between model and past observations
1

Run the model to minimize difference
I between model and observations
0.8
0.6
0.4
0.2
0.2
0.4
2.5
1.5
0.5
0.5
temp. vs. days

N
1X
1
min kx0 xb k2B 1 +
kHM(ti , x0 ) bi k2R 1 .
x0 2
2
i
i=0
April 2009
18 / 323

The principle:
Minimize deviation between model and past observations
Run the model to minimize difference
I between model and observations
Predict temperature for the next day
0.8
0.6
0.4
0.2
0.2
0.4
2.5
1.5
0.5
0.5
temp. vs. days
April 2009
18 / 323

Analysis of the oceans heat content:
CERFACS (2009)
Much better fit!

April 2009
19 / 323
Applications: aeronautical structure design

minimize weight while maintaining structural integrity
mass reduction during optimization

SAMTECH (2009)
April 2009
20 / 323
Applications: asteroid trajectory matching

find todays asteroid whose orbital parameters
match best one observed 50 years ago
Milani, Sansaturio et al. (2005)

April 2009
21 / 323
Applications: discrete choice modelling (1)

Context: simulation of individual choices in Transportation (or other)
(mode, route, time of departure,. . . )
Random utility theory
An individual i assigns to alternative j the utility
Uij = [ parameters explaining factors ] + [ random error ]
Illustration :
Ubus = distance 1.2 price of ticket 2.1 delay wrt to car travel +
April 2009
22 / 323
Probability that individual i chooses alternative j rather than

alternative k given by
prob (Uij Uik for all k)
Data: mobility surveys (MOBEL)
find the parameters in the utility function to

maximize likelihood of observed behaviours
April 2009
23 / 323

1
Normal
Lognormal
Spline
0.9
0.8
0.7
0.6
0.5
0.4
Using standard statistics
0.3
0.2
0.1
Using advanced optimization
0
0
10
15
20
25
30
35
40
45
Estimation of the value of time lost in congested trafic

(with and without advanced optimization)
April 2009
24 / 323
Applications: Poisson image denoising (1)

Consider a two dimensional image with noise proportional to signal
zij = uij + nf (uij )
where n is a random Gaussian noise. How to recover the original uij ?
use the pixel values as much as possible

while minimizing sharp transitions (gradients)
This leads to the optimization problem

min
u
Z
(uij zij log(uij )) +
ij
kuk
April 2009
25 / 323

Some spectacular results: a 512 512 picture with 95% noise
April 2009
26 / 323

Some spectacular results: a 512 512 picture with 95% noise
Chan and Chen (2007)

April 2009
26 / 323
Applications: shock simulation in video games

Optimize the realism of the motion of multiple rigid bodies in space
complementarity problem
q [q(t)]v (t) 0
(q(t)) 0
(q(t) = positions, v (t) =
dq
dt (t)
= velocities)
system of inequalities and equalities

used in realtime for video animation
Anitescu and Potra (1996)
April 2009
27 / 323
Applications: finance
1
risk management
portofolio analysis
3
1
0.9
FX markets
Normal
Spline
0.8
0.7
0.6
0.5
0.4
Standard
0.3
Optimized
0.2
0.1
0
-1.5
-1
-0.5
0.5
Investment distribution
for the BoJ 1991-2004
4
...
1.5
Everybody loves
an optimizer!
April 2009
28 / 323
History
Where does optimization come from?

Nous sommes comme des nains juches sur des epaules de geants, de telle
sorte que nous puissions voir plus de choses et de plus eloignees que nen
voyaient ces derniers. Et cela, non point parce que notre vue serait
puissante ou notre taille avantageuse, mais parce que nous sommes portes
et exhausses par la haute stature des geants.
We are like dwarfs standing on the shoulders of giants, such that we can
see more things and further away than they could. And this, not because
our sight would be more powerful or our height more advantageous, but
because we are carried and heigthened by the high stature of the giants.
Bernard de Chartres (1130-1160)
April 2009
29 / 323
Euclid (300 BC)
History
Al-Khwarizmi (783-850)
April 2009
30 / 323
Isaac Newton (1642-1727)
History
Leonhardt Euler (1707-1783)
April 2009
31 / 323
J. de Lagrange (1735-1813)
History
Friedrich Gauss (1777-1855)
April 2009
32 / 323
History
Augustin Cauchy (1789-1857) George Dantzig (1914-2005)
April 2009
33 / 323
Michael Powell
History
Roger Fletcher
April 2009
34 / 323
Basic concepts
Return to the mathematical problem
min f (x)
x
such that
c(x) 0
Difficulties:
the objective function f (x) is typically complicated (nonlinear)
it is also often costly to compute
there may be many variables
the constraints c(x) may defined a complicated geometry
April 2009
35 / 323
Basic concepts
An example unconstrained problem

minimize : f (, ) = 102 + 10 2 + 4 sin() 2 + 4
3
Two local minima: (2.20, 0.32) and (2.30, 0.34)

How to find them?
April 2009
36 / 323
Basic concepts
Trust-region methods
iterative algorithms
find local solutions only
Algorithm 1.1: The trust-region framework
Until an (approximate) solution is found:
Step 1: use a model of the nonlinear function(s)
within region where it can be trusted
Step 2: notion of sufficient decrease
Step 3: measure achieved and predicted reductions
Step 4: decrease the region radius if unsuccessful
April 2009
37 / 323
Illustration
minimize : f (, ) = 102 + 10 2 + 4 sin() 2 + 4
Two local minima: (2.20, 0.32) and (2.30, 0.34)
April 2009
38 / 323
x0 = (0.71, 3.27)
Illustration
and
f (x0 ) = 97.630
Contours of f
Contours of m0 around x0
(quadratic model)
April 2009
39 / 323
k
0
k
1
sk
(0.05, 0.93)
Illustration
f (xk + sk )
43.742
f /mk
0.998
xk+1
x0 + s0
April 2009
40 / 323
k
0
1
k
1
2
sk
(0.05, 0.93)
(0.62, 1.78)
Illustration
f (xk + sk )
43.742
2.306
f /mk
0.998
1.354
xk+1
x0 + s0
x1 + s1
April 2009
40 / 323
k
0
1
2
k
1
2
4
sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
Illustration
f (xk + sk )
43.742
2.306
6.295
f /mk
0.998
1.354
0.004
xk+1
x0 + s0
x1 + s1
x2
April 2009
40 / 323
k
0
1
2
3
k
1
2
4
2
sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
Illustration
f (xk + sk )
43.742
2.306
6.295
29.392
f /mk
0.998
1.354
0.004
0.649
xk+1
x0 + s0
x1 + s1
x2
x2 + s2
April 2009
40 / 323
k
0
1
2
3
4
k
1
2
4
2
2
sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
(0.32, 0.15)
Illustration
f (xk + sk )
43.742
2.306
6.295
29.392
31.131
f /mk
0.998
1.354
0.004
0.649
0.857
xk+1
x0 + s0
x1 + s1
x2
x2 + s2
x3 + s3
April 2009
40 / 323
k
0
1
2
3
4
5
k
1
2
4
2
2
4
sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
(0.32, 0.15)
(0.03, 0.02)
Illustration
f (xk + sk )
43.742
2.306
6.295
29.392
31.131
31.176
f /mk
0.998
1.354
0.004
0.649
0.857
1.009
xk+1
x0 + s0
x1 + s1
x2
x2 + s2
x3 + s3
x4 + s4
April 2009
40 / 323
k
0
1
2
3
4
5
6
k
1
2
4
2
2
4
8
sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
(0.32, 0.15)
(0.03, 0.02)
(0.02, 0.00)
Illustration
f (xk + sk )
43.742
2.306
6.295
29.392
31.131
31.176
31.179
f /mk
0.998
1.354
0.004
0.649
0.857
1.009
1.013
xk+1
x0 + s0
x1 + s1
x2
x2 + s2
x3 + s3
x4 + s4
x5 + s5
April 2009
40 / 323
Illustration
Path of iterates:
From another x0 :
April 2009
41 / 323
Illustration
And then. . .
Does it (always) work?
The answer tomorrow!

(and subsequent days for a (biased) survey of new optimization methods)
Thank you to you for your attention
April 2009
42 / 323
April 2009
43 / 323
H(
Lesson 2:
for unconstrained problems
April 2009
43 / 323
The basic text for this course
A. R. Conn, N. I. M. Gould and Ph. L. Toint,

Trust-Region Methods,
Nr 01 in the MPS-SIAM Series on Optimization,
SIAM, Philadelphia, USA, 2000.
April 2009
44 / 323
Background material
2.1: Background material
April 2009
45 / 323
Background material
Scalar mean-value theorems

Let S be an open subset of IRn , and suppose f : S IR is
continuously differentiable throughout S. Then, if the segment
x + s S for all [0, 1],
f (x + s) = f (x) + hx f (x + s), si
for some [0, 1].
Let S be an open subset of IRn , and suppose f : S IR is twice
x + s S for all [0, 1],
f (x + s) = f (x) + hx f (x), si + 12 hs, xx f (x + s)si
for some [0, 1].
April 2009
46 / 323
Background material
Vector mean-value theorem
Let S be an open subset of IRn , and suppose F : S IRm is

x + s S for all [0, 1],
Z
F (x + s) = F (x) +
0
x F (x + s)s d.
April 2009
47 / 323
Background material
Taylors scalar approximation theorems (1)
Let S be an open subset of IRn , and suppose f : S IR is

continuously differentiable throughout S. Suppose further that
x f (x) is Lipschitz continuous at x, with Lipschitz constant
(x) in some appropriate vector norm. Then, if the segment
x + s S for all [0, 1],
|f (x + s) m(x + s)| 21 (x)ksk2 ,
where
m(x + s) = f (x) + hx f (x), si.
April 2009
48 / 323
Background material
Taylors scalar approximation theorems (2)
Let S be an open subset of IRn , and suppose f : S IR is twice

xx f (x) is Lipschitz continuous at x, with Lipschitz constant
(x) in some appropriate vector norm and its induced matrix
norm. Then, if the segment x + s S for all [0, 1],
|f (x + s) m(x + s)| 61 (x)ksk3 ,
where
m(x + s) =
f (x) + hx f (x), si + 12 hs, xx f (x)si.
April 2009
49 / 323
Background material
Taylors vector approximation theorem
Let S be an open subset of IRn , and suppose F : S IRm is

x F (x) is Lipschitz continuous at x, with Lipschitz constant
(x) in some appropriate vector norm and its induced matrix
norm. Then, if the segment x + s S for all [0, 1],
kF (x + s) M(x + s)k 21 (x)ksk2 ,
where
M(x + s) = F (x) + x F (x)s.
April 2009
50 / 323
Background material
Newtons method
Solve
F (x) = 0
Idea: solve linear approximation
F (x) + J(x)s = 0
quadratic local convergence
. . . but not globally convergent
Yet the basis of everything that follows
April 2009
51 / 323
Background material
Unconstrained optimality conditions

Suppose that f C 1 , and that x is a local minimizer of f (x).
Then
x f (x ) = 0.
Suppose that f C 2 , and that x is a local minimizer of f (x).
Then the above holds and the objective functions Hessian at
x is positive semi-definite, that is
hs, xx f (x )si 0 for all s IRn .
hs, xx f (x )si > 0 for all s 6= 0 IRn
strict local solution
April 2009
52 / 323
Background material
Constrained optimality conditions (1)

minimize
subject to
and
f (x)
ci (x) = 0, for i E,
ci (x) 0, for i I,
Active set:
F{2,3}
c2 (x) = 0
r
rx3
F{2}
j
C
F
F{1,2} *
rx1
6
F{1}
rx2
F{3}
A(x1 ) = {1, 2}
A(x2 ) =
A(x3 ) = {3}
c1 (x) = 0
r
Y F{1,3}
H
c3 (x) = 0
April 2009
53 / 323
Background material
Constrained optimality conditions (2): first order

Ignore constraint qualification!
Suppose that f , c C 1 , and that x is a local solution. Then
there exist a vector of Lagrange multipliers y such that
X
x f (x ) =
[y ]i x ci (x )
iEI
ci (x ) = 0 for all i E
ci (x ) 0 and [y ]i
and ci (x )[y ]i
0 for all i I
= 0 for all i I.
Lagrangian: `(x, y ) = f (x)
yi ci (x)
iEI
April 2009
54 / 323
Background material
Constrained optimality conditions (3): second order

Suppose that f , c C 2 , and that x is a local minimizer of f (x).
Then there exist a vector of Lagrange multipliers y such that firstorder conditions hold and
hs, xx `(x , y )si 0 for all s N+
where N+ is the set of vectors s such that
S
T
hs, x ci (x )i = 0 for all i E {j A(x ) I | [y ]j > 0}
and
hs, x ci (x )i 0 for all i {j A(x )
I | [y ]j = 0}
strict complementarity: hs, xx `(x , y )si > 0 for all s N+ (s 6= 0)

strict local solution
April 2009
55 / 323
Background material
Optimatity conditions (convex 1)
Assume now that C is convex

normal cone of C at x C,
def
N (x) = {y IRn | hy , u xi 0, u C}
tangent cone of C at x C
T (x) = N (x)0 = cl{(u x) | 0 and u C}
def
April 2009
56 / 323
Background material
Optimality conditions (convex 2)

11111111
00000000
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000000000000000000000000
11111111111111111111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000
000000000000 11111111111111
111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000
000000000000 11111111111111
111111111111
00000000000000
000000000000 11111111111111
111111111111
00000000000000
11111111111111
00000000000000
11111111111111
00000000000000
11111111111111
00000000000000
11111111111111
The Moreau decomposition
April 2009
57 / 323
Background material
Optimatity conditions (convex 2)
Suppose that C 6= is convex, closed, that f is continuously

differentiable in C, and that x is a first-order critical point for
the minimization of f over C. Then, provided that constraint
qualification holds,
x f (x ) N (x ).
April 2009
58 / 323
Background material
Conjugate gradients
Idea: minimize a convex quadratic on successive nested Krylov subspaces
Algorithm 2.1: Conjugate-gradients (CG)
Given x0 , set g0 = Hx0 + c and let p0 = g0 .
For k = 0, 1, . . . , until convergence, perform the iteration
k
= kgk k22 /hpk , Hpk i
xk+1 = xk + k pk
gk+1 = gk + k Hpk
k
= kgk+1 k22 /kgk k22
pk+1 = gk+1 + k pk
April 2009
59 / 323
Background material
Preconditioning
Idea: change the variables x = Rx and define M = R T R.
Algorithm 2.2: Preconditioned CG
Given x0 , set g0 = Hx0 + c, and let v0 = M 1 g0 and p0 = v0 .
For k = 0, 1, . . . , until convergence, perform the iteration
k
= hgk , vk i/hpk , Hpk i
xk+1 = xk + k pk
gk+1 = gk + k Hpk
vk+1 = M 1 gk+1
k
= hgk+1 , vk+1 i/hgk , vk i
pk+1 = vk+1 + k pk
April 2009
60 / 323
Background material
Lanczos method
Idea: compute an orthonormal basis of the successive nested Krylov
subspaces
makes QkT HQk tridiagonal
Algorithm 2.3: Lanczos
Given r0 , set y0 = r0 , q1 = 0.
For k = 0, 1, . . ., perform the iteration,
k
= kyk k2
qk
= yk /k
= hqk , Hqk i
yk+1 = Hqk k qk k qk1
April 2009
61 / 323
Background material
Another view on the Conjugate-Gradients method
Conjugate Gradients = Lanczos + LDLT (Cholesky)
Lanczos
Cholesky
{z
Conjugate gradients in one of the Krylov subspaces
April 2009
62 / 323
The Trust-region algorithm
2.2: The trust-region algorithm
April 2009
63 / 323
The trust-region idea
use a model of the objective function

define a trust-region where it is thought adequate
Bk = {x IRn | kx xk kk k }
find a trial point by sufficiently decreasing the model in Bk
compute the objective function at the trial point
compare achived vs. predicted reductions
reduce k if unsatisfactory
April 2009
64 / 323
The basic trust-region algorithm

Algorithm 2.4: Basic trust-region algorithm (BTR)
Step 0: Initialization. x0 and 0 given, compute f (x0 ) and set k = 0.
Step 1: Model definition. Choose k kk and define a model mk in Bk .
Step 2: Step calculation. Compute sk that sufficiently reduces the
model mk with xk + sk Bk .
Step 3: Acceptance of the trial point. Compute f (xk + sk ) and define
k =
f (xk ) f (xk + sk )
.
mk (xk ) mk (xk + sk )
If k 1 , then define xk+1 = xk + sk ; otherwise define xk+1 = xk .
Step 4: Trust-region radius update.

k+1
[k , )
[2 k , k ]
[1 k , 2 k ]
if k 2 ,
if k [1 , 2 ),
if k < 1 .
Increment k by one and go to Step 1.

April 2009
65 / 323
Basic convergence theory
2.3: Basic convergence theory
April 2009
66 / 323
Assumptions
f C2
f (x) lbf
kxx f (x)k ufh
mk C 2 (Bk )
mk (xk ) = f (xk )
def
gk = x mk (xk ) = x f (xk )
kxx mk (x)k umh 1 for all x Bk
1
une kxkk
kxk une kxkk
. . . but use k kk = k k2 in what follows!

April 2009
67 / 323
The Cauchy step

Idea: minimize mk on the Cauchy arc
def
xkC (t) = {x | x = xk tgk , t 0 and x Bk }.

4
3.5
2
2.5
1
2
0
1.5
1
1
2
0.5
4
4
1.5
2.5
3.5
4.5
the Cauchy point

April 2009
68 / 323
The Cauchy point for quadratic models

Three cases when minimizing the quadratic mk along the Cauchy arc:
1
1
0
4
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
15
0 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1
0.1
0.2
kgk k
mk (xk ) mk (xk ) kgk k min
, k
k
C
1
2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
April 2009
69 / 323
The Cauchy point for general models

Three cases when minimizing the general mk along the Cauchy arc:
1
1
0
4
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
15
0 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1
0.1
0.2
0.3
kgk k
mk (xk ) mk (xk ) dcp kgk k min
, k
k
AC
0.4
0.5
0.6
0.7
0.8
0.9
April 2009
70 / 323
The meaning of sufficient decrease

In both cases, we get:
Sufficient decrease condition:

kgk k
, k ,
mk (xk ) mk (xk + sk ) mdc kgk k min
k

Immediate consequence:
Suppose that x f (xk ) 6= 0. Then mk (xk + sk ) < mk (xk ) and
sk 6= 0.
k is well defined!
April 2009
71 / 323
The exact minimizer is OK

Suppose that, for all k, sk ensures that
mk (xk ) mk (xk + sk )amm [mk (xk ) mk (xkM )],
Then sufficient decrease is obtained.
1.5
80
60
45
30
20
5
5.2
0.5
2
1
5.2
1
2
5.2
0.5
5
20
30
45
60
1.5
1.5
0.5
0.5
1.5
April 2009
72 / 323
Taylor and minimum radius

For all k,
|f (xk + sk ) mk (xk + sk )| ubh 2k ,
Suppose that gk 6= 0 and that

k
mdc kgk k(1 2 )

.
ubh
Then iteration k is very successful and

k+1 k .
Suppose that kgk k lbg > 0 for all k. Then is a constant
lbd > 0 such that, for all k
k lbd .
April 2009
73 / 323
First-order convergence (1)
Suppose that there are only finitely many successful iterations.

Then xk = x for all sufficiently large k and x is first-order
critical.
Suppose that there are infinitely many successful iterations.
Then
lim inf kx f (xk )k = 0.
k
idea: infinite descent if not critical
April 2009
74 / 323
First-order convergence (2)

Suppose that there are infinitely many successful iterations.
Then lim kx f (xk )k = 0.
k
kgk k s6
s
For 1 >0
s
s s
2
s
s
s
s
{ti }
{`i }
s
s
s
s
s
S
K
s s
s
s
s
-k
April 2009
75 / 323
Convex models (1)

Suppose that min [xx mk (x)] for all x [xk , xk + sk ] and
for some > 0. Then
2
ksk k kgk k.

idea: mk curves upwards!
Suppose that {xki } x and x is first-order critical, and that
there is a constant smh > 0 such that
min min [xx mk (x)] smh
xBk
whenever xk is sufficiently close to x Suppose finally that

xx f (x ) is nonsingular. Then the complete sequence of iterates {xk } converges to x .
idea: steps too short to escape local basin
April 2009
76 / 323
Convex models (2)
But. . .
4
4
4
4
4
April 2009
77 / 323
Asymptotically exact Hessians

Assume also that
lim kxx f (xk ) xx mk (xk )k = 0 whenever
lim kgk k = 0
Suppose that {xki } x and x is first-order critical, that

sk 6= 0 for all k sufficiently large, and that xx f (x ) is positive
definite. Then the complete sequence of iterates {xk } converges to x , all iterations are eventually very successful and
the trust-region radius k is bounded away from zero.
idea: sufficient decrease implies that
mk (xk ) mk (xk + sk ) mqd ksk k2 > 0.
Then k 1.
April 2009
78 / 323
Second order: the eigen point

Assume 0 > k (Hk ).
Then fine the eigen direction uk such that
huk , gk i 0, kuk kk = k huk , Hk uk i snc k 2k ,
Minimize the model along uk to compute the eigen point:
mk (xkE ) = mk (xk + tkE uk ) = min mk (xk + tuk )
t(0,1]
1.5
0.5
0.5
1.5
2
2
1.5
0.5
0.5
1.5
April 2009
79 / 323
Model decrease at the eigen point

Suppose: 0 > k (Hk ), uk is an eigen direction and
kxx mk (x) xx mk (y )k lch kx y k
for all x, y Bk . Then
mk (xk ) mk (xkE ) sod k min[k2 , 2k ].
(quadratic or general model)
1
9
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
April 2009
80 / 323
Second order: convergence theorems

lim sup min [xx f (xk )] 0.
k
Suppose that x is an isolated limit point of the sequence of

iterates {xk }. Then x is a second-order critical point.
Assume also that, for 3 >1,
k 2 and k max k+1 [3 k , 4 k ]
Let x be any limit point of the sequence of iterates. Then x
is a second-order critical point.
April 2009
81 / 323
Different trust-region norms
1.5
1.5
1.5
0.5
0.5
0.5
0.5
0.5
0.5
1.5
1.5
1.5
2
2
1.5
0.5
0.5
1.5
2
2
1.5
0.5
0.5
1.5
2
2
1.5
0.5
0.5
April 2009
1.5
82 / 323
Using norms for scaling

Idea: change the variables
Sk w = s
Then
def
mkS (xk + w ) f (xk + Sk w ) = f S (w ),

BkS = {xk + w | kw k k }.
mkS (xk ) = f (xk ),
gkS = w f S (0) = SkT x f (xk )
HkS ww f S (0) = SkT xx f (xk )Sk .

Thus
mkS (xk + w ) =
=
=
=
=
f (xk ) + hgkS , w i + 12 hw , HkS w i

f (xk ) + hSkT x f (xk ), w i + 12 hw , SkT Hk Sk w i
f (xk ) + hx f (xk ), Sk w i + 12 hSk w , Hk Sk w i
f (xk ) + hx f (xk ), si + 12 hs, Hk si
mk (xk + s)
April 2009
83 / 323
Scaling: the geometry
3.5
3.5
3.5
2.5
2.5
2.5
1.5
1.5
1.5
0.5
0.5
1.5
2.5
3.5
0.5
0.5
1.5
2.5
3.5
0.5
0.5
1.5
2.5
April 2009
3.5
84 / 323
Solving the subproblem
2.4: Solving the subproblem
April 2009
85 / 323
The subproblem
Assume
Euclidean norm
quadratic model (possibly non-convex)
(drop the index k)
min
sIRn
q(s) hg , si + 21 hs, Hsi
subject to ksk2
April 2009
86 / 323
Possible approaches
exact minimization
truncated conjugate-gradients
CG + Lanczos (GLTR)
doglegs
eigenvalue based methods
(projection methods)
April 2009
87 / 323
The exact minimizer
Any global minimizer of q(s) subject to ksk2 = satisfies the

equation
H(M )s M = g ,
where
def
H(M ) = H + M I is positive semi-definite,

M 0 and
M (ks M k2 ) = 0.
If H(M ) is positive definite, s M is unique.
Note: M is the Lagrange multiplier
April 2009
88 / 323
The exact minimizer: a geometrical view

4
4
4
4
4
4
4
4
4
April 2009
89 / 323
Finding the exact minimizer

Eigenvalue decomposition of H:
H = U T U
where 1 2 n . Characterization implies that
M 1
Suppose that > 1 and define
s() = H()1 g = U T ( + I )1 Ug
New formulation (one dimensional):
ks()k2
ks()k22 = kU T ( + I )1 Ug k22 = k( + I )1 Ug k22 =
where i = [Ug ]i .
n
X
i=1
i2
(i + )2
April 2009
90 / 323
The convex case

30
ks()k2
25
@
I
@
20
@
@
@ solution curve

15
10

0
?
6
April 2009
91 / 323
A nonconvex case
30
ks()k2
25
20
15
10
0
4
April 2009
92 / 323
The hard case: 1 = 0

30
ks()k2
25
20

15
10
no root larger than 2

5
root near 2.2

0
4
April 2009
93 / 323
Near the hard case: 1 0

30
ks()k2
=1
= 0.1
= 0.01
= 0.00001
25
20
15
10
April 2009
94 / 323
The secular equation

Idea: consider the secular equation
def
() =
Then
1/ks()k2
1
1
=0
ks()k2
2.5
6
2
=1
= 0.1
= 0.01
= 0.00001
1.5
0.5
2
apply Newtons method to () = 0 : + = ()/0 ()

April 2009
95 / 323
The derivatives of ()
Suppose g 6= 0. Then
() is strictly increasing ( > 1 ), and concave.
0 () =
hs(), s()i
ks()k32
where
s() = H()1 s().
Note: if H() = LLT and Lw = s(), then
hs(), s()i = hs(), LT L1 s()i = kw k2
April 2009
96 / 323
Newtons method on the secular equation

Algorithm 2.5: Exact trust-region solver
Let > 1 and > 0 be given.
1
Factorize H() = LLT .
Solve LLT s = g .
Solve Lw = s.

Replace by +
ksk2

ksk22
.
kw k22
But . . . more complications due to

bracketing the root (initial + update)
termination rule
may be preconditioned
More (1978), More-Sorensen (1983), Dollar-Gould-Robinson (2009)
April 2009
97 / 323
Approximate solution by truncated CG
Fact: CG never reenters the `2 trust-region

3.5
3.5
2.5
2.5
1.5
1.5
1
1
0.5
0.5
1.5
2.5
3.5
4.5
1.5
2.5
3.5
4.5
4
4
May be preconditioned
Steihaug (1983), T. (1981)
April 2009
98 / 323
Approximate solution by the GLTR

ST might hit the boundary for steepest descent step sometimes slow
Idea: solve the subproblem on the nested Krylov subspaces

Algorithm 2.6: Two-phase GLTR algorithm
as long as interior: conjugate-gradients
on the boundary: Lanczos method + subproblem solution in
Krylov space
(smooth transition)
Gould-Lucidi-Roma-T. (1999)
April 2009
99 / 323
Doglegs
Idea: use steepest descent and the full Newtonstep (requires convexity?)
dogleg curve
sD
sN
@
R
@
sC
s DD
@
I
@
@
double-dogleg curve
trust-region boundary
Powell (1970), Dennis-Mei (1979)

April 2009
100 / 323
An eigenvalue approach
Rewrite
(H + M)s = g
as

(H g )
s
1

= Ms
or (introducing the parameter )

H
gT

s
1

= ()
M 0
0 1

s
1
choose such that

0,
H + M positive semi-definite
(kskM ) = 0
Rendl-Wolkowicz (1997), Rojas-Santos-Sorensen (1999)

April 2009
101 / 323
Bibliography
Bibliography for lesson 2 (1)

J. E. Dennis and H. H. W. Mei,
Two New Unconstrained Optimization Algorithms Which Use Function and Gradient Values,
Journal of Optimization Theory and Applications, 28(4):453-482, 1979.
H. S. Dollar, N. I. M. Gould and D. P. Robinson,
On solving trust-region and other regularised subproblems in optimization,
Rutherford-Appleton Laboratory, Chilton, UK, Report RAL-TR-2009-003, 2009.
S. M. Goldfeldt, R. E. Quandt and H. F. Trotter,
Maximization by quadratic hill-climbing,
Econometrica, 34:541-551, 1966.
N. I. M. Gould, S. Lucidi, M. Roma and Ph. L. Toint,
Solving the trust-region subproblem using the Lanczos method,
SIAM Journal on Optimization, 9(2):504-525, 1999.
K. Levenberg,
A Method For The Solution Of Certain Problems In Least Squares,
Quarterly Journal on Applied Mathematics, 2:164168, 1944.
D. Marquardt,
An Algorithm For Least-Squares Estimation Of Nonlinear Parameters,
SIAM Journal on Applied Mathematics, 11:431-441, 1963.
J. J. Mor
e,
The Levenberg-Marquardt algorithm: implementation and theory,
Numerical Analysis, Dundee 1977 (A. Watson, ed.), Springer Verlag, Heidelberg, 1978.
J. J. Mor
e,
Recent developments in algorithms and software for trust region methods,
in Mathematical Programming: The State of the Art (A. Bachem, M. Gr
otschel and B. Korte, eds.), Springer Verlag,
Heidelberg, pp. 258-287, 1983.
J. J. Mor
e and D. C. Sorensen,
Computing A Trust Region Step,
SIAM Journal on Scientific and Statistical Computing, 4(3):553-572, 1983.
April 2009
102 / 323
Bibliography
D. D. Morrison,
Methods for nonlinear least squares problems and convergence proofs,
Proceedings of the Seminar on Tracking Programs and Orbit Determination, (J. Lorell and F. Yagi, eds.), Jet Propulsion
Laboratory, Pasadena, USA, pp. 1-9, 1960.
M. J. D. Powell,
A New Algorithm for Unconstrained Optimization,
in Nonlinear Programming (J. B. Rosen, O. L. Mangasarian and K. Ritter, eds.), Academic Press, London, pp. 31-65,
1970.
F. Rendl and H. Wolkowicz,
A Semidefinite Framework for Trust Region Subproblems with Applications to Large Scale Minimization,
Mathematical Programming, 77(2):273-299, 1997.
M. Rojas, S. A. Santos and D. C. Sorensen,
A new matrix-free algorithm for the large-scale trust-region subproblem,
CAAM, Rice University, TR99-19, 1999.
D. Winfield,
Function and functional optimization by interpolation in data tables,
Ph.D. Thesis, Harvard University, Cambridge, USA, 1969.
April 2009
103 / 323
Lesson 3:
Derivative-free optimization,
infinite dimensions and filters
April 2009
104 / 323
Derivative free optimization
3.1: Derivative-free optimization
April 2009
105 / 323
An application of trust-regions: unconstrained DFO

Consider the unconstrained problem
min f (x)
x
Gradient (and Hessian) of f (x) unavailable

physical measurement
object code
typically small-scale (but not always. . . )
Derivative free optimization (DFO)
f (x) typically very costly
Exploit each evaluation of f (x) to the utmost possible
considerable interest of practitioners
April 2009
106 / 323
Interpolation methods for DFO

Idea: Winfield (1973), Powell (1994)
Until convergence:
Use the available function values to build a polynomial
interpolation model mk :
mk (yi ) = f (yi )
yi Y ;
Minimize the model in a trust region, yielding a new

potentially good point;
Compute a new function value.
Y = interpolation set { points yi at which f (yi ) is known }
April 2009
107 / 323
A naive trust-region method for DFO: illustration
50
45
40
35
30
25
20
15
10
1
5
2
0
3
2.5
1.5
0.5
0.5
3
1
1.5
April 2009
108 / 323
50
45
40
35
30
25
20
15
10
1
5
2
0
3
2.5
1.5
0.5
0.5
3
1
1.5
April 2009
108 / 323
50
45
40
35
30
25
20
15
10
1
5
2
0
3
2.5
1.5
0.5
0.5
3
1
1.5
April 2009
108 / 323
50
45
40
35
30
25
20
15
10
1
5
2
0
3
2.5
1.5
0.5
0.5
3
1
1.5
April 2009
108 / 323
Interpolation methods for DFO (2)
To be considered:
poisedness of the interpolation set Y
choice of models (linear, quadratic, in between, beyond)
convergence theory
numerical performance
April 2009
109 / 323
Poisedness
Assume a quadratic model

mk (xk + s) = fk + hgk , si + 12 hs, Hk si
Thus
p = 1 + n + 12 n(n + 1) = 12 (n + 1)(n + 2)
parameters to determine need p function values (|Y | = p)
Not sufficient!
need geometric conditions for the points in Y . . .
April 2009
110 / 323
Poisedness: geometry with n = 2, p = 6
20
18
16
14
12
10
8
6
4
2
0
2
1
0
1.5
0.5
1
0.5
1.5
With these 6 data points in IR3 . . . ...

April 2009
111 / 323
. . . is this the correct interpolation?

April 2009
111 / 323
. . . or this?
April 2009
111 / 323
. . . or this?
April 2009
111 / 323
The difference ... is zero on a quadratic curve containing Y !

April 2009
111 / 323
Poisedness: geometry (2)

If {i ()}pi=1 = basis for quadratic polynomials
p
X
i i (yj ) = f (yj )
j = 1, . . . , p
i=1
Possible poisedness measure:
1 (y1 )
..
(Y ) = det
.
1 (yp )
p (y1 )
..
.
p (yp )
Y (well) poised |(Y )|

scale for the spread of the yi s
notion of geometry improvement
April 2009
112 / 323
Lagrange polynomials
Remarkable: replace y by y+ in Y :
(Y+ )
= L(y+ , y ) is independent of the basis {i ()}pi=1
(Y )
where

y Y
L(y , y ) =
1
0
if y = y
if y =
6 y
is the Lagrange fundamental polynomial

Note: for quadratic interpolation, L(, y ) is a quadratic polynomial!
Powell (1994)
April 2009
113 / 323
Interpolation using Lagrange polynomials

Idea: use the Lagrange polynomials to define the (quadratic) interpolant
by
mk (xk + s) =
f (y )Lk (xk + s, y )
y Yk
And then. . .
kf (xk + s) mk (xk + s)k
kxk + s y k2 |Lk (xk + s, y )|
y Yk
April 2009
114 / 323
Interpolation using Lagrange polynomials: construction
The original function. . .

April 2009
115 / 323
. . . and the interpolation set

April 2009
115 / 323
The first Lagrange polynomial

April 2009
115 / 323
The second Lagrange polynomial

April 2009
115 / 323
The third Lagrange polynomial

April 2009
115 / 323
The fourth Lagrange polynomial

April 2009
115 / 323
The fifth Lagrange polynomial

April 2009
115 / 323
The sixth Lagrange polynomial

April 2009
115 / 323
The final interpolating quadratic

April 2009
115 / 323
Other algorithmic ingredients

include a new point in the interpolation set
need to drop an existing interpolation point?
select which one to drop: make Y as poised as possible
Note: model/function minimizer may produce bad geometry!!

geometry improvement procedure . . .
trust-region radius management
trust region = Bk = {xk + s | ksk k }
standard: reduce k when no progress
DFO: more complicated! (Could reduce to fast and prevent
convergence. . . )
verify that Y is poised before reducing k

April 2009
116 / 323
Improving the geometry in a ball
attempt to reuse past points that are close to xk

attempt to replace a distant point of Y
attempt to replace a close point of Y
good geometry for the current k improvement impossible
April 2009
117 / 323
Self-correction at unsuccessful iterations (1)
At iteration k, define the set of exchangeable far points:
Fk = {y Yk | ky xk k > k and Lk (xk + sk , y ) 6= 0}

and the set of exchangeable close points (for some > 1):
Ck = {y Yk \{xk } | ky xk k k and |Lk (xk +sk , y )| }
April 2009
118 / 323
Self-correction at unsuccessful iterations (2)

Remarkably,
Whenever
iteration k is unsuccessful,
Fk =
k is small w.r.t. kgk k,
then Ck 6= .
(an improvement of the geometry by a factor is always possible at
unsuccessful iterations when k is small and all exchangeable far points
have been considered)
no need to reduce k forever!
April 2009
119 / 323
Trust-region algorithm for DFO (1)

Algorithm 3.1: TR for DFO
Step 0: Initialization. Given: x0 , 0 , Y0 ( L0 (, y )). Set k = 0.
Step 1: Criticality test [complicated and not discussed here]
Step 2: Solve the subproblem. Compute sk that sufficiently reduces mk (xk + s)
within the trust region,
Step 3: Evaluation.
Compute f (xk + sk ) and

k =
.
mk (xk ) mk (xk + sk )
Step 4: Define the next iterate and interpolation set.

the big question
Step 5: Update the Lagrange polynomials.
April 2009
120 / 323
Trust-region algorithm for DFO (2)

Algorithm 3.2: Step 4: Define xk+1 and Yk+1
Step 4a: Successful iteration. If k 1 , accept
xk + sk , increase k and exchange xk + sk with
y = arg max ky (xk + sk )k2 |Lk (xk + sk , y )|
y Yk
Step 4b: Replace far point. If k < 1 (+ other technical condition) and Fk 6= , reject
xk + sk , keep k and exchange xk + sk with
y Fk
Step 4c: Replace close point. If k < 1 (+ other technical condition) and Ck 6= , reject
xk + sk , keep k and exchange xk + sk with
y Ck
Step 4d: Decrease the radius. Otherwise, reject xk + sk , keep Yk , and reduce k .
April 2009
121 / 323
Global convergence results

If the model is at least fully linear, then
lim inf kx f (xk )k = lim inf kgk k = 0
k
Scheinberg and T. (2009)
With more costly algorithm:

If the model is at least fully linear, then
lim kx f (xk )k = lim kgk k = 0
If the model at least fully quadratic, then iterates converge to

2nd-order critical points
April 2009
122 / 323
For an efficient numerical method. . .

Many more issues:
which Hessian approximation?
(full/vs diagonal or structured)
details of criticality tests difficult
details for numerically handling interpolation polynomials
(Lagrange, Newton),
reference shifts,
...
good codes around: NEWUOA, DFO efficient solvers
Powell (2008 and previously), Conn, Scheinberg and T. (1998)
Conn, Scheinberg and Vicente (2008)
April 2009
123 / 323
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
0.2
0
April 2009
124 / 323
Infinite dimensional problems
3.2: Infinite dimensional problems
April 2009
125 / 323
Why consider infinite dimensions?
Main motivation:
large-scale finite dimensional problems often result from discretized

continuous ones (surfaces, time-trajectories, optimal control, . . . )
behaviour on these problems dominated by infinite dimensional
properties
Need to investigate infinite dimensions to ensure consistency!
Two main cases: Hilbert and Banach spaces.
April 2009
126 / 323
Convergence in Hilbert spaces
The trust-region algorithm is well-defined and globally

convergent in Hilbert spaces.
Riescz representation theorem V 0 V
Cauchy point results from one dimensional minimization
(but xkM may not exist!)
def
k = 1 + sup kxx mk (x)kV,V 0 ,

xBk
def
min [H] =
hd, Hdi
dV,d6=0 hd, di
inf
April 2009
127 / 323
Wht happens in Banach spaces ?

Problem: dual space different from the primal!
Need further assumptions:

x f (x) V for all x V.
x f is uniformly continuous from V to V.
For every x {x V | f (x) f (x0 )},
hx f (x), x f (x)i (kx f (x)kV 0 )kx f (x)kV ,
for some continuous monotonically increasing real from
[0, ] to itself, independent of x and such that (0) = 0
and (t) > 0 for t > 0.
April 2009
128 / 323
Convergence in Banach spaces, nevertheless

The last assumption implies
hgk , gk i (kgk kV 0 )kgk kV
. . . and sufficient decrease follows!
Is this realistic?
The additional assumptions always hold for V = Lp () and
2 p < , when kg kLp () ubg .
Under these additional assumptions, the trust-region algorithm
is well-defined and globally convergent in Banach spaces.
April 2009
129 / 323
Filter algorithms
3.3: Filter algorithms
April 2009
130 / 323
Filter algorithms
Monotonicity (1)
Global convergence theoretically ensured by

some global measure. . .
unconstrained : f (xk )
(constrained : some merit function at xk )
. . . with strong monotonic behaviour (Lyapunov function)

Also practically enforced by
algorithmic safeguards around Newton method
(linesearches, trust regions)
April 2009
131 / 323
Filter algorithms
Monotonicity (2)
But, unfortunately,
classical safeguards limit efficiency!
Of interest: design less obstructive safeguards while
ensuring better numerical performance
(the Newton Liberation Front!)
continuing to guarantee global convergence properties
Is this possible?
Typically:
abandon strict monotonicity of usual measures
but insist on average behaviour instead
April 2009
132 / 323
Filter algorithms
Non-monotone trust-regions
Idea: f (xk+1 ) < f (xk ) replaced by f (xk+1 ) < fr (k)
with
fr (k) < fr (k1)
Further issues:
suitably define the reference iteration r (k)
adapt the trust-region algorithm: also compare achieved and
predicted reductions since reference iteration
T. (1997)
April 2009
133 / 323
Filter algorithms
Non-monotone TR algorithm
Algorithm 3.3: Non monotone TR algorithm (NMTR)
Step 0: Initialization. Given: x0 , 0 , 1 , 2 , 1 , 2 . Compute f (x0 ), set k = 0.
Step 1: Model definition. Choose k kk and define mk in Bk .
Step 2: Step calculation. Compute sk that sufficiently reduces mk and xk + sk Bk .
Step 3: Acceptance of the trial point.
compute f (xk + sk ),
kh =
Define the reference iteration r (k) k and

k1
X
[mi (xi ) mi (xi + si )],
i=r (k)
Define
iS
f (xr (k) ) f (xk + sk )

k = max h
,
. .
k + mk (xk ) mk (xk + sk ) mk (xk ) mk (xk + sk )
If k 1 , then define xk+1 = xk + sk ; otherwise define xk+1 = xk .

Step 4: Trust-region radius update. Set
8
< [k , )
[2 k , k )
k+1
:
[1 k , 2 k ]
Increment k by one and go to Step 1.
if k 2 ,
if k [1 , 2 ),
if k < 1 .
April 2009
134 / 323
Filter algorithms
Sufficient decrease for NMTR

k
X
f (xp(k) ) f (xk+1 ) 1 mdc
kgj k min
j=p(k),jS
kgj k
, j
j
with p(k) = r (k) when hk ck , or p(k) = k otherwise
r6

f (xk )
r
r
6 r
r r
r
r
r
r
r
6 r
r 6
r
r
r
r
r r
6
f (x0 ) f (xk+1 ) 1 mdc
k
X
t=0,tS

kgt k min
r
r r
r
-k

kgt k
, t .
t
April 2009
135 / 323
Filter algorithms
Choosing the reference iteration (1)
Algorithm 3.4: Choosing r (k)

Step 3a: update the iterate. Compute f (xk + sk ) and set
fr f (xk + sk )
.
k = max
,
r + mk (xk ) mk (xk + sk ) mk (xk ) mk (xk + sk )
If k < 1 , then xk+1 = xk and go to Step 4; otherwise xk+1 = xk + sk and
c = c + mk (xk ) mk (xk+1 ) and r = r + mk (xk ) mk (xk+1 )
Step 3b: update the best value. If f (xk+1 ) < fmin then set fc = fmin = f (xk+1 ),
c = 0 and ` = 0 and go to Step 4; otherwise, ` ` + 1.
Step 3c: update the reference candidate. If f (xk+1 ) > fc , set fc = f (xk+1 ) and
c = 0.
Step 3d: possibly reset the reference value. If ` = m, set fr = fc and r = c .
April 2009
136 / 323
Filter algorithms
Choosing the reference iteration (2): example with m = 2
6
r
f (xk )
dr
r
r
dr
r
dr
r
r
r
r
r
dr
r
r
r
r
rd
rd
r
r
r
r
r
r
rd
-k
0 0 0 0 1 2 0 1 0 1 1 1 2 1 2 0 1 1 1 1 2 0 1 1 2 0 0 1 2 0
: reference iteration
: new best value
? : reference iteration redefined (l = m)
April 2009
137 / 323
Filter algorithms
An unconstrained example
2
-2
-4
-6
-8
-10
-12
0
10
20
30
40
50
60
70
Monotone and non-monotone TR (using LANCELOT B) on EXTROSNB

April 2009
138 / 323
Filter algorithms
Introducing the filter
A fruitful alternative: filter methods

Constrained optimization :
using the SQP step, at the same time:
reduce the objective function f (x)
reduce constraint violation (x)
CONFLICT
April 2009
139 / 323
Filter algorithms
The filter point of view

Fletcher and Leyffer replace question:
What is a better point?
by:
What is a worse point?
Of course, y is worse than x when
f (x) f (y ) and (x) (y )
(y is dominated by x)
When is xk + sk acceptable?
Fletcher and Leyffer (2002), Fletcher, Gould, Leyffer, T. and W
achter (2002)
April 2009
140 / 323
Filter algorithms
The standard filter

Idea: accept non-dominated points
no monotonicity of merit function implied
f (x)
6
r
r
r
0
(x)
April 2009
141 / 323
Filter algorithms
Filling up the standard filter

Note: filter area is bounded in the (f , ) space!
f (x)
f (xk )
f (xk ) k
(1 )k
0
(x)
filter area (non)-monotonically decreasing
April 2009
142 / 323
Filter algorithms
The (unconstrained) feasibility problem

Feasibility
Find x such that
c(x) 0
e(x) = 0
for general smooth c and e.
Least-squares
Find x such that
min
i2
April 2009
143 / 323
Filter algorithms
A multidimensional filter (1)

(Simple) idea: more dimensions in filter space
1 (x)
6
q
q
q
0
2 (x)
(full dimension vs. grouping)

April 2009
144 / 323
Filter algorithms
A multidimensional filter (2)
Additionally
possibly consider unsigned filter entries
use a trust-region algorithm when
trial point unacceptable
convergence to non-zero solution
( internal restoration)
Sound convergence theory
Gould, Leyffer and T. (2005)
April 2009
145 / 323
Filter algorithms
Numerical experience: FILTRANE
Fortran 95 package
large scale problems (CUTEr interface)
includes several variants of the method
signed/unsigned filters
Gauss-Newton, Newton or adaptive models
pure trust-region option
uses preconditioned conjugate-gradients
+ Lanczos for subproblem solution
part of the GALAHAD library

Gould, Orban and T. (2003), Gould and T. (2007)
April 2009
146 / 323
Filter algorithms
Numerical experience (1)

1
fraction of problems for which solver in within of best
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Default
Pure trust region
0
10
Filter vs. trust-region (CPU time)

April 2009
147 / 323
Filter algorithms

1
0.9
0.8
0.7
p()
0.6
0.5
0.4
0.3
0.2
0.1
Filter
LANCELOT
0
10
Filter vs. LANCELOT B (CPU time)

April 2009
148 / 323
Filter algorithms

1
0.9
0.8
0.7
p()
0.6
0.5
0.4
0.3
0.2
0.1
Filter
Unfettered
0
10
Filter vs. free Newton (CPU time)

April 2009
149 / 323
Filter algorithms
Filter for unconstrained optimization

Again simple idea: use gi instead of i
g1 (x)
6
q
q
q
0
g2 (x)
(full dimension vs. grouping)

April 2009
150 / 323
Filter algorithms
A few complications. . .
But . . .
g (x) = 0 not sufficient for nonconvex problems!
When negative curvature found:

reset filter
set upper bound on acceptable f (x)
(or. . . add a dimension for f in the filter)
reasonable convergence theory
April 2009
151 / 323
Filter algorithms

1
0.9
0.8
0.7
p()
0.6
0.5
0.4
0.3
0.2
0.1
Default
Pure trust region
LANB
1
10
Filter vs. trust-region and LANCELOT B (iterations)

April 2009
152 / 323
Filter algorithms
Numerical experience: HEART6

4
log of residual
10
12
14
10
20
30
40
iterations
50
60
70
80
Filter vs. trust-region and LANCELOT B

April 2009
153 / 323
Filter algorithms
Numerical experience: EXTROSNB

6
log of residual
50
100
150
iterations
200
250
300
Filter vs. trust-region and LANCELOT B

April 2009
154 / 323
Filter algorithms
Numerical experience: LOBSTERZ

6
log of residual
10
50
100
150
iterations
200
250
300
Filter vs. trust-region

April 2009
155 / 323
Filter algorithms
Conclusions
derivative-free optimization possible and efficient

non-monotonicity definitely helpful
filter methods very efficient
Newtons behaviour unexplained
. . . more research needed?
April 2009
156 / 323
Bibliography

1
A. R. Conn, K. Scheinberg and Ph. L. Toint,

A Derivative Free Optimization Algorithm in Practice,
Proceedings of the 7th AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, St.
Louis, Missouri, September 1998.
A. R. Conn, K. Scheinberg and L. N. Vicente,
Introduction to Derivative-free Optimization,
SIAM-MPS Series on Optimization, 2008.
R. Fletcher and S. Leyffer,
Nonlinear Programming without a penalty function,
Mathematical Programming A, 91(2):239-269, 2002.
R. Fletcher, N. I. M. Gould, S. Leyffer, Ph. L. Toint and A. W
achter,
Global Convergence of Trust-Region SQP-Filter Algorithms for Nonlinear Programming,
N. I. M. Gould and S. Leyffer and Ph. L. Toint,
A Multidimensional Filter Algorithm for Nonlinear Equations and Nonlinear Least-Squares,
N. I. M. Gould, D. Orban and Ph. L. Toint,
GALAHADa library of thread-safe Fortran 90 packages for large-scale nonlinear optimization, ACM Transactions on
Mathematical Software, 29(4):353-372, 2003.
N. I. M. Gould,C. Sainvitu and Ph. L. Toint,
A Filter-Trust-Region Method for Unconstrained Optimization,
N. I. M. Gould and Ph. L. Toint,
FILTRANE, a Fortran 95 filter-trust-region package for solving systems of nonlinear equalities, nonlinear inequalities
and nonlinear least-squares problems,
ACM Transactions on Mathematical Software, 33(1):3-25, 2007.
M. J. D. Powell,
A direct search optimization method that models the objective by quadratic interpolation,
Presentation at the 5th Stockholm Optimization Days, Stockholm, 1994.
April 2009
157 / 323
Bibliography
10 M. J. D. Powell,
Developments of NEWUOA for minimization without derivatives,
IMA Journal of Numerical Analysis, 28(4):649-664, 2008.
11 K. Scheinberg and Ph. L. Toint,
Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization,
Report TR09/06, FUNDP, Namur, 2009.
12 Ph. L. Toint,
A non-monotone trust-region algorithm for nonlinear optimization subject to convex constraints,
13 D. Winfield,
Function Minimization by Interpolation in a Data Table,
Journal of the IMA, 12:339-347, 1973.
April 2009
158 / 323
Lesson 4:
Optimization with
convex constraints
April 2009
159 / 323
Projections and the projected gradient path
4.1: Projection algorithms
April 2009
160 / 323
Projections on simple convex domains (1)

r
y = PC (y )
r
rP (y )
C
PC (y )
r
r
PC (y )
r
[x` ]i
def
[y ]i
[PC (y )]i =
[xu ]i
ry
if [y ]i [x` ]i ,
if [x` ]i < [y ]i < [xu ]i ,
if [xu ]i [y ]i
April 2009
161 / 323
r PC (y )
y = PC (y )
r
April 2009
162 / 323

. . . but also the ordered simplex . . .
0.8
0.6
x3
0.4
0.2
0
1
0.8
1
0.6
0.8
0.6
0.4
x2
0.4
0.2
0.2
0
x1
Idea: use those simple projections!

April 2009
163 / 323
The projected gradient path

Define the projected gradient path = the Cauchy arc
p(t, x) = PC [x tx f (x)]
r x t f (x)
m x
r
p(t, x) = p(tm , x)
x f (x)
April 2009
164 / 323
Two projections
PT (x) [x f (x)]6C 0
PC [x x f (x)] xC 0
April 2009
165 / 323
Measuring criticality
Measure the gain in linearized objective function per step of length :

def
min
hx f (x), di
(x, ) =
x+dF ,kdk
(t) = kPF (x tg (x)) xk
(x, ) =
(x)
April 2009
166 / 323
The criticality measure

min
hx f (x), di
(x) = (x, 1) =
x+dF ,kdk1
def
the feasible reduction in the linearized objective for unit steps

reduces to kx f (x)k2 in the unconstrained case
April 2009
167 / 323
The projected gradient path and
k
xk r
r
. projected path
xk + dk
April 2009
168 / 323
The generalized Cauchy point

Approximately minimize mk () on the PG path
Find
def
xkGC = PF [xk tkGC gk ] = xk + skGC
(tkGC > 0)
such that
mk (xkGC ) f (xk ) + ubs hgk , skGC i
(below linear approximation)
and either
mk (xkGC ) f (xk ) + lbs hgk , skGC i
(above linear approximation)
or
kPT (x GC ) [gk ]k epp |hgk , skGC i|
k
(close to paths end)
or
kskGC k frd k
(close to TR boundary)
April 2009
169 / 323
Searching for the GCP (1)

5
4
3
2
1
0
1
2
3
4
5
6
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
mk (0 + s) = 3.57s1 1.5s2 s3 + s1 s2 + 3s2 + s2 s3 2s3 such that s 1.5 and 2.8
April 2009
170 / 323
Searching for the GCP (2)

5
4
3
2
1
0
1
2
3
4
5
6
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
April 2009
171 / 323
Useful properties
Piecewise search for xkGC well-defined and finite
1
(, ), (, ) and (, ) are continuous
(x, ) is non-decreasing
(x, ) is non-decreasing
(x, ) is non-increasing
(xk ) (xk , kskGC k) + 2kPT (x GC ) [gk ]k
hgk , skGC i = (xk , kskGC k) 0
(xk , t) t kPT (x(t)) [x f (xk )]k
|(x) (y )| Lkx y k
if x f (x) is continuous on a bounded level set
April 2009
172 / 323
Trust-region method for convex constraints
Cauchy decrease along the projected gradient path

The Cauchy condition: minimize mk long the projected gradient path

k
mk (xk ) mk (xk + sk ) CR k min
, k , 1
1 + kHk k
Idea: Linesearch conditions imply
mk (xk ) mk (xkGC ) ubs |hgk , skGC i| = ubs (xk , kskGC k)
but need
kPT (P[xk tj gk ]) [gk ]k epp
|hgk , sk (tj )i|

k
def
Now define k = min[1, k ] k . Then

k
mk (xk ) mk (xk ) dcp k min
, k
k
GC
April 2009
173 / 323
How far can we turn the handle?
As above. . .
All limit points are first-order critical, i.e.
lim k = 0
But . . .
does the active set settle ?
(needed for 2nd-order convergence or rate)
April 2009
174 / 323
Active constraints identification (1)

Require further assumptions: let L = { limit points of {xk } }
x L , {x ci (x )}iA(x ) are linearly independent
x L , x f (x ) ri{N (x )}
k, A(xkGC ) A(xk + sk )
For each connected component of limit points L(x ) L ,
there exists a set A {1, . . . , m} for which
A(x ) = A for all x L(x ).
Idea: connectivity + uniqueness of Lagrange multipliers
each L(x ) belongs to a single facet of C
April 2009
175 / 323

There exists a (0, 1) such that
dist(x , L0 )
for every x L and each compact connected component of
limit points L0 such that A(L0 ) 6= A(x ).
Idea: continuity + compactness well separated
There exist (0, 14 ), (0, 1), and k1 0 such that, for
k k1 , there is a Lk such that
xk V(Lk , ) = {x IRn | dist(x, Lk ) }
and
A(x) A(Lk ) for all x V(Lk , ).
Idea: partition the complete sequence into convergent subsequences
each xk near a unique Lk
April 2009
176 / 323

There exists k2 k1 such that, if for some k k2 ,
j A(Lk ) and j 6 A(xkGC ),
then, for some (0, 1) independent of k and j,
k .
Idea: complicated (uses criticality measures for incomplete constraint sets)
incomplete local A(xk ) implies not critical
(more technical arguments here)
There exists an active set A , such that
x L
A(x ) = A
and, for all k sufficiently large,

A(xk ) = A(xkGC ) = A
April 2009
177 / 323
Further convergence results
. . . and now it works in T (xk ) ( now continuous for large k ) with

xx mk remplaced by xx mk` xx `(xk , yk )
convergence to isolated critical points
(generalized) eigen-points for the Lagrangian
(needs consistent multiplier estimates!)
convergence to second-order points
fast asymptotic rate of convergence
April 2009
178 / 323
Barriers and interior points
4.2: Barrier methods
April 2009
179 / 323
A simple case
Consider C = {x IRn | x 0} and build
log
def
(x, ) = f (x) he, log(x)i = f (x)
n
X
log(xi )
i=1
Under acceptable assumptions,

x () = arg min log (x, )
x
converge to the solution of the problem

min f (x)
xC
when & 0.
April 2009
180 / 323
How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
original objective function

April 2009
181 / 323
How it works. . .
h
i2
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
original objective function + barrier ( = 50)

April 2009
181 / 323
How it works. . .
h
i2
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8

April 2009
181 / 323
How it works. . .
h
i2
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8

April 2009
181 / 323
How it works. . .
h
i2
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8

April 2009
181 / 323
How it works. . .
h
i2
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8

April 2009
181 / 323
Other barriers: reciprocals

b R() (x, ) =
n
X
i=1
1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
1
[x]i
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
1
2
( = 2, log + R( ), R(1) and R(2))

April 2009
182 / 323
The barrier function

def
(x, ) = f (x) + b(x, ) = f (x) he, log(x)i

Assume:
b(x, ) is defined for all x ri{C} and all > 0, and is
C 2 (ri{C}) w.r.t. x.
> 0, > 0 bbh (, ) 1 such that
kxx b(x, )k bbh (, )
x C such that dist(x, C)
limp b(yp , ) = + > 0 and {yp }
p=0 such that
yp ri{C} and
lim dist(yp , C) = 0.
April 2009
183 / 323
An elementary barrier algorithm

Algorithm 4.1: A simple barrier algorithm
Step 0: Initialization. Given: x0 ri{C}, 0 > 0. Set k = 0.
Step 1: Inner minimization. (Approximately) solve the problem
min (x, k )
x
by applying an unconstrained (inner) algorithm, starting from

a suitable starting point xk,0 ri{C}.
Let xk+1 be the corresponding (approximate) solution.
Step 2: Update the barrier parameter. Choose k+1 > 0 such that
lim k = 0.
Increment k by one and return to Step 1.

April 2009
184 / 323
A first inner primal algorithm

Algorithm 4.2: Inner primal 1
Step 0: Initialization. Given: xk,0 ri{C}, k,0 , 1 , 2 , 1 , 2 , k
Compute (x0 , k ), set j = 0.
(0, 1).
Step 1: Model definition. Define mk,j of (xk,j + s, k ) in Bk,j of the form

f
b
mk,j (xk,j + s) = mk,j
(xk,j + s)+mk,j
(xk,j + s),
Step 2: Step calculation. Compute sk,j that sufficiently reduces mk,j and
such that xk,j + sk,j Bk,j .
Step 3: Acceptance of the trial point. If xk,j + sk,j 6 C or if dist(xk,j + sk,j , C) <
k dist(xk,j , C), set k,j = , xk,j+1 = xk,j and go to Step 4.
Otherwise compute (xk,j + sk,j , k ) and
(xk,j , k ) (xk,j + sk,j , k )
.
k,j =
mk,j (xk,j ) mk,j (xk,j + sk,j )
Then if k,j 1 , define xk,j+1 = xk,j + sk,j ; otherwise define xk,j+1 = xk,j .
8
< [k,j , )
[2 k,j , k,j ]
k,j+1
:
[1 k,j , 2 k,j ]
Increment
if k,j 2 ,
if k,j [1 , 2 ),
if k,j < 1 .
j by one and go to Step 1.

April 2009
185 / 323
Models and assumptions

Use separate models for f and b!
f
b
(xk,j + s) + mk,j
(xk,j + s),
Assume:
k, > 0, bbmh (, k ) 1 k, j 0,
b
kxx mk,j
(x, k )k bbmh (, k )
x Bk,j C such that dist(x, C) .

k, j 0 x Bk,j ri{C},
f
kxx mk,j
(x)k umh
April 2009
186 / 323
(Inner) convergence properties

There exists mdb (k) (0, 1) such that
dist(xk,j , C) mdb (k)
for all j. Moreover, for all j and all x such that kx xk,j k
(1 k )dist(xj , C), we have that
kxx b(x, )k bbh (k mdb (k), k )
and
b
kxx mk,j
(xk,j , )k bbmh (k mdb (k), k )
If k,j (1 k )mdb (k), then

|(xk,j + sk,j , k ) mk,j (xk,j + sk,j )| ubh (k)2k,j
. . . and all the nice convergence properties follow!

April 2009
187 / 323
Constrained Cauchy and eigen-points (1)

Idea: restrict the step, not the trust region!
xk,j
k dist(xk,j , C)
@
@
@
R
@
k,j
xk,j + sk,j
ri{C}
But . . . what of sufficient decrease ???

April 2009
188 / 323
Constrained Cauchy and eigen-points (2)
Redefine the Cauchy arc:

def
CC
xk,j
(t) = {x | x = xk,j tgk,j , t 0, tkgk,j k (1 k )dk,j and x Bk },
kgk,j k
mk,j (xk,j )mk,j (xk,j ) kgk,j k min
, k,j , (1 k )dk,j
k,j
CC
1
2
. . . etc, etc, etc . . .
April 2009
189 / 323
A second inner primal algorithm

Algorithm 4.3: Inner primal 2
Step 0: Initialization. Given: xk,0 ri{C}, k,0 , 1 , 2 , 1 , 2 , k (0, 1).
Compute (xk,0 , k ), set j = 0.
Step 1: Model definition.
f
b
Define mk,j (xk,j + s) = mk,j
(xk,j + s) + mk,j
(xk,j + s)
Step 2: Step calculation. Define dk,j = dist(xk,j , C). Compute sk,j such that
xk,j + sk,j Bk,j C and dist(xk,j + sk,j , C) k dk,j
and such that it sufficiently reduces mk,j
Compute (xk,j + sk,j , k ) and
(xk,j , k ) (xk,j + sk,j , k )
k,j =
.
1 , define xk,j+1 = xk,j + sk,j ; otherwise define xk,j+1 = xk,j .
Then if k,j

8
< [k,j , )
[2 k,j , k,j ]
k,j+1
:
[1 k,j , 2 k,j ]
Increment j by one and go to Step 1.
if k,j 2 ,
if k,j [1 , 2 ),
if k,j < 1 .
April 2009
190 / 323
The log barrier and its derivatives

Return to:
min f (x)
x0
The log barrier

b(x, ) = he, log(x)i
giving
log (x, ) = f (x) he, log(x)i
Using the notation X = diag(x1 , . . . , xn ), we obtain that
x b(x, ) = X 1 e
and
xx b(x, ) = X 2 e
April 2009
191 / 323
The primal log-barrier algorithm

Algorithm 4.4: Primal log-barrier algorithm
Step 0: Initialization. Given: x0 > 0, 0 > 0, and the forcing functions D () and
E (). Set k = 0.
Step 1: Inner minimization. Choose a value k (0, 1). Approximately minimize
the log-barrier function log (x, k ) = f (x) k he, log(x)i starting
from xk and using an inner algorithm in which
b
1
2
e, si + 12 hs, Xk,j
si
mk,j
(xk,j + s) = k he, log(xk,j )i hXk,j
Stop this algorithm as soon as an iterate xk,j = xk+1 is found for which
1
kx f (xk+1 ) k Xk+1
ek D (k ),
2
] E (k )
min [xx f (xk+1 ) + k Xk+1
and xk+1 > 0.

limk k = 0. Increment k by one and return to Step 1.
April 2009
192 / 323
Convergence of the primal log-barrier algorithm (1)

OK for first order! . . . but existence of limit points not guaranteed
Define
A subsequence {xkj } is consistently active w.r.t. the bounds if,
for each i = 1, . . . , n, either
lim [xkj ]i = 0 or lim inf [xkj ]i > 0.
(Each bound constraint is asymptotically active or inactive for the

complete subsequence.)
def
A{xkj } = {i {1, . . . , n} | lim [xkj ]i = 0}.

j
Note: finite number of such subsequences a partition of {xk }

April 2009
193 / 323
Convergence of the primal log-barrier algorithm (2)

Finally,
Under appropriate assumptions,
lim inf [x f (xk )]i 0,
k
(i = 1, . . . , n).
Furthermore, for every consistently active subsequence {xk` },

lim [x f (xk` )]i = 0,
(i 6 A{xk` })
and
lim inf hu, [xx f (xk` )]ui 0
`
for each u | [u]i = 0 whenever i A{xk` }.
April 2009
194 / 323
The primal-dual framework (1)

2
In practice, as xk & 0, xx mk,j (xk,j ) + k Xk,j
causes slow progress.
Idea: replace this by

1
xx mk,j (xk,j ) + Xk,j
Zk,j
where Zk,j is a bounded positive diagonal.

Alternatively: KKT conditions for original problem:
x m(x) z = 0,
XZ = 0,
x 0,
z 0,
x m(x) z = 0,
XZ = e
x 0,
z 0.
Perturb:
April 2009
195 / 323
The primal-dual framework (2)

Now write Newtons method for the perturbed problem:
xx mk,j (xk,j )xk,j zk,j = gk,j + zk,j ,
Xk,j zk,j + Zk,j xk,j = k e Xk,j Zk,j e,
xk,j + xk,j 0, zk,j + zk,j 0.
Substituting the 2nd equation into the 1st:
h
i
h
i
1
1
Zk,j xk,j = gk,j k Xk,j
e
But
1
gk,j k Xk,j
e = x log (x, k )
Hence
h
i
1
Zk,j xk,j = x log (x, k )
April 2009
196 / 323
The primal-dual inner algorithm (1)

Algorithm 4.5: Inner primal-dual algorithm
Step 0: Initialization. Given: xk,0 ri{C}, zk,0 > 0 , k,0 , 1 , 2 , 1 ,2 , k .
Compute f (xk,0 ), set j = 0.
Step 1: Model definition. In Bk,j , define
h
i
1
f
1
(xk,j + s) k he, log(xk,j )i + hXk,j
e, si 21 hs, Xk,j
Zk,j si
Step 2: Step calculation. Define dk,j = dist(xk,j , C). Compute a step sk,j such
that xk,j + sk,j Bk,j , dist(xk,j + sk,j , C) k dk,j , and
(
"
mk,j (xk,j )mk,j (xk,j +sk,j ) max kgk,j k min
kgk,j k
k,j

k,j =
#
, k,j , (1 k )dk,j
)
h
i
2
2
2 2
, k,j min k,j , k,j , (1k ) dk,j
Compute log (xk,j + sk,j , k ) and
log (xk,j , k ) log (xk,j + sk,j , k )

.
If k,j 1 , then xk,j+1 = xk,j + sk,j , else xk,j+1 = xk,j .

April 2009
197 / 323
The primal-dual inner algorithm (2)
Algorithm 4.6: Inner primal-dual algorithm (2)

8
< [k,j , )
[2 k,j , k,j ]
k,j+1
:
[1 k,j , 2 k,j ]
if k,j 2 ,
if k,j [1 , 2 ),
if k,j < 1 .
Step 5: Update the dual variables. Set zk,j+1 > 0. Increment j by one,go to
Step 1.
April 2009
198 / 323
The primal-dual outer algorithm
Algorithm 4.7: Outer primal-dual algorithm

Step 0: Initialization. Given: x0 > 0, z0 > 0, 0 > 0 and the forcing functions
D (), E (), C (). Set k = 0.
Step 1: Inner minimization. Choose k (0, 1). Approximately minimize log (x, k )
from xk using the primal-dual inner algorithm. Stop as soon as an iterate
(xk,j , zk,j ) = (xk+1 , zk+1 ) is found for which
kx f (xk+1 ) zk+1 k D (k ),
kXk+1 Zk+1 k I k C (k ),
and
1
min [xx f (xk+1 ) + Xk+1
Zk+1 ] E (k )
xk+1 > 0 and zk+1 > 0.

Step 3: Update the barrier parameter. Choose k+1 > 0 such that limk k = 0.
Increment k by one and return to Step 1.
1
Note: choosing zk,j = k Xk,j
e primal algorithm!
April 2009
199 / 323
Updating the dual variables

How to compute zk,j+1 in practice? Newton equations give
1
1
z k,j+1 = k Xk,j
e Xk,j
Zk,j sk,j .
. . . but what about zk,j+1 0?

Define
"
I = zul min
1
e, zk,j , k Xk,j+1
e
, zuu max
1
e, zk,j , 1
k e, k Xk,j+1 e
and choose

zk,j+1 =
PI [z k,j+1 ]
zk,j
if xk,j+1 = xk,j + sk,j

if xk,j+1 = xk,j ,
April 2009
200 / 323
Properties of the dual variables

Then zk,j+1 > 0 and

[zk,j ]i uzi max
1
[xk,j ]i

,1 .
If, furthermore,
lim ksk,j k = 0 when
lim kgk,j k = 0
then

1
lim zk,j k Xk,j
e = 0 if
lim kgk,j k = 0.
asymptotically exact barrier Hessian for fixed

April 2009
201 / 323
Scaling of the inner iterations

In practice, scaling is crucial!
Ideally,
k kk,j = k kxx mk,j (xk,j ) =
q
1
h, [Hk,j + Xk,j
Zk,j ]i
Under the usual assumptions, k kk,j is uniformly equivalent to

the Euclidean norm for fixed k.
xk,j
xk,j + sk,j
ri{C}
all usual convergence properties for fixed k
April 2009
202 / 323
Scaling of the outer iterations (1)
Scaled tests:
kx f (xk+1 ) zk+1 k[k+1] D (k )
kXk+1 Zk+1 k I k2 C (k ),

min
12
Mk+1
(xx f (xk+1 )
with
21
1
Xk+1
Zk+1 )Mk+1
E (k ),
def
1
Mk+1 = Hk+1 + Xk+1
Zk+1
But this matrix is unbounded when k % !
April 2009
203 / 323
Scaling of the outer iterations (2)

Fortunately,
Under the usual assumptions, the convergence properties are
preserved if
D (k )
lim

k
k
and
C (k ) k
lim
= 0.
k mini [xk+1 ]i
Moreover
If exact derivatives are used, the (k ) can be chosen to ensure
componentwise near quadratic rate of convergence.
This is quite remarkable!
April 2009
204 / 323
Barriers for general convex constraints

Now,
log (x, ) = f (x) he, log(c(x))i
The primal-dual model becomes
f
b
mk,j (xk,j + sk,j ) = mk,j
(xk,j + sk,j ) + mk,j
(xk,j + sk,j ),
with
b
mk,j
(xk,j + sk,j )
k he, log(c(xk,j ))i k hC 1 (xk,j )e, A(xk,j )sk,j i

+ 12 hA(xk,j )sk,j , [C 1 (xk,j )Yk,j ]A(xk,j )sk,j i
12
m
X
[yk,j ]i hsk,j , xx ci (xk,j )sk,j i
i=1
Quite a mouthful. . . but otherwise everything is OK!
April 2009
205 / 323
The outer primal-dual algorithm for convex constraints

xx `(xk,j , yk,j ) = xx f (xk,j )
m
X
[yk,j ]i xx ci (xk,j )
def
Gk,j = A (xk,j )C
(xk,j )Yk,j A(xk,j )
i=1
Algorithm 4.8: Primal-dual algorithm for convex constraints

Step 0: Initialization Given: x0 | c(x0 ) > 0, y0 > 0, 0 > 0, C (), D () and E ().
Set k = 0.
Step 1: Inner minimization Choose k (0, 1). Approximately minimize
log (x, k ) = f (x) k he, log(c(x))i
from xk . Stop as soon as (xk,j , yk,j ) = (xk+1 , yk+1 ) is found such that
kx f (xk+1 ) AT (xk+1 )yk+1 k D (k ),
kC (xk+1 )Yk+1 e k I k C (k ),
min [xx `(xk+1 , yk+1 ) + Gk+1 ] E (k )
and
(c(xk+1 ), yk+1 ) 0.
limk k = 0. Increment k by one and return to Step 1.
April 2009
206 / 323
Bibliography

1
J. V. Burke,
On the identification of active constraints,
SIAM Journal on Numerical Analysis, 25(5):1197-1211, 1988.
J. V. Burke,
On the identification of active constraints II: the nonconvex case,
J. V. Burke and J. J. Mor
e,
Exposing Constraints,
J. V. Burke, J. J. Mor
e and G. Toraldo,
Convergence properties of trust region methods for linear and convex constraints,
Mathematical Programming A, 47(3):305-336, 1990.
A. R. Conn, N. I. M. Gould, D. Orban and Ph. L. Toint,
A primal-dual trust-region algorithm for minimizing a non-convex function subject to bound and linear equality
constraints,
N. I. M. Gould, D. Orban, A. Sartenaer and Ph. L. Toint,
Superlinear Convergence of Primal-Dual Interior Point Algorithms for Nonlinear Programming,
Global convergence of a class of trust region algorithms for optimization with simple bounds,
LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A),
Spinger Verlag, Heidelberg, 1992.
A. R. Conn, N. I. M. Gould, A. Sartenaer and Ph. L. Toint,
Global convergence of a class of trust region algorithms for optimization using inexact projections on convex
constraints,
SIAM Journal on Optimization, 3(1):164221, 1993.
April 2009
207 / 323
Bibliography
10 A. V. Fiacco and G. P. McCormick,

The Sequential Unconstrained Minimization Technique for Nonlinear Programming: a Primal-Dual Method,
Management Science, 0(2):360-366, 1964.
11 A. V. Fiacco and G. P. McCormick,
Nonlinear Programming: Sequential Unconstrained Minimization Techniques,
Wiley and Sons, Chichester (UK), 1968.
12 Ph. L. Toint,
Global convergence of a class of trust region methods for nonconvex minimization in Hilbert space,
April 2009
208 / 323
Lesson 5:
Sparsity, partial separability
and multilevel methods:
exploiting problem structure
April 2009
209 / 323
Outline
Sparsity and partial separability
Multilevel problems
Bibliography
April 2009
210 / 323
Sparsity
5.1: Sparsity and

partial separability
April 2009
211 / 323
Sparsity
Sparsity
A matrix is sparse when the proportion and/or distribution of
its zero entries allows its efficient numerical usage
An (oriented) graph is asociated with every sparse (non)symmetric matrix
April 2009
212 / 323
Sparsity
Main benefits of sparsity

Sparsity and optimization Hessian (and) Jacobian matrices
very important time/space savings in solving Newtons equations
(unconstrained or KKT)
1
2
factorizations (reduced fill-in)

iterative methods (fast matrixvector products)
sometimes important in approximations schemes

1
2
3
derivative-free methods (makes the number of function evaluations

linear in the number of variables)
finite-difference approximations
quasi Newton methods
a path for parallel computations

exploiting sparsity = an active algorithmic industry!
April 2009
213 / 323
Sparsity
The Curtis-Powell-Reid algorithm for estimating sparse

Jacobians
Finite differences for a Jacobian column:

Jei
c(x + hei ) c(x)

h
Question: How many finite differences for estimating a 5 5 Jacobian

with the structure:
0
B
B
B
@
C
C
C?
A
April 2009
214 / 323
Sparsity

Jacobians

Jei
c(x + hei ) c(x)

h

with the structure:
0
B
B
B
@
C
C
C
A
April 2009
214 / 323
Sparsity

Jacobians

Jei
c(x + hei ) c(x)

h

with the structure:
0
B
B
B
@
C
C
C
A
April 2009
214 / 323
Sparsity

Jacobians

Jei
c(x + hei ) c(x)

h

with the structure:
0
B
B
B
@
C
C
C
A
April 2009
214 / 323
Sparsity

Jacobians
Jei
c(x + hei ) c(x)

h

with the structure:
0
B
B
B
@
Je
c(x + he1 + he4 ) c(x)

h
Je
Answer: 3 finite-differences!
C
C
C
A
c(x + he2 + he3 ) c(x)

h
Je
c(x + he5 ) c(x)

h
Curtis, Powell and Reid (1974), Steihaug et al.

April 2009
214 / 323
Sparsity
The CPR algorithm for estimating sparse Jacobians

Algorithm 5.1: CPR algorithm
Build the column groups.
Place the columns in as few groups as possible such that
two columns in the same group have their nonzero entries in
different rows
Estimate the finite differences.
1
2
P
Build a difference vector h = group hi ei
Compute v = c(x + h) c(x)
Reconstruct the Jacobian.

Jij
vi
hi
for all j such that j group
April 2009
215 / 323
Sparsity
A graph colouring interpretation

Consider the intersection graph for the columns:
2
April 2009
216 / 323
Sparsity

2
April 2009
216 / 323
Sparsity

2
April 2009
216 / 323
Sparsity

2
minimize the number of colours,

such that adjacent nodes have different colours
can build column groups using heuristic algorithms for graph colouring
Coleman and More, (1983)
April 2009
216 / 323
Sparsity
Estimating sparse Hessians (1)

Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C?
C
C
C
A
April 2009
217 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
Exploiting symmetry in CPR ( a direct method)
Powell and T (1979), Coleman and More (1984)

April 2009
217 / 323
Sparsity

Question: Can we do better?
0
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
218 / 323
Sparsity

0
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
218 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
0
C
C
C
C
C
C
C
C
C
A
Apply CPR on the lower triangular part of the Hessian
April 2009
218 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
But what about the conflicts with the upper triangular part?
April 2009
218 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
A more efficient substitution method. . .
Powell and T (1979), Coleman and More (1984) for a graph interpretation
April 2009
218 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
218 / 323
Sparsity

B
B
B
B
B
B
B
B
B ?
@
C
C
C
C
C
C
C
C
C
A
April 2009
218 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
218 / 323
Sparsity

B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
218 / 323
Sparsity

B
B
B ?
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
218 / 323
Sparsity

0
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
April 2009
218 / 323
Sparsity
Optimized version for PDE stencils
Example: the 5-points Laplacian operator in 2D

(non-symmetric and symmetric)
April 2009
219 / 323
Sparsity
Optimized version for PDE stencils
Example: the 5-points Laplacian operator in 2D

(non-symmetric and symmetric)
April 2009
219 / 323
Sparsity
Partial separability
A more geometric concept:
Griewank and T. (1982)
f (x) is partially separable iff

f (x) =
p
X
fi (Ui x) where the matrices Ui are of low rank
i=1
if Ui = disjoint columns of the identity matrix (totally) separable

common case: Ui = overlapping columns of the identity matrix
f (x) =
p
X
fi (xSi )
i=1
Vocabulary:
element functions, element variables, internal variables ui = Ui x
April 2009
220 / 323
Sparsity
Sources and examples of partially separable functions

Example 1:
f (x1 , x2 , x3 , x4 ) = f1 (x1 , x2 ) + f2 (x2 , x3 , x4 ) + f3 (x4 , x5 )
Example 2:
f (x1 , x2 , x3 , x4 ) = f1 (3x1 + x2 ) + f2 (2x2 + x3 2x4 , x4 + 3x5 )
|
{z
} | {z }
| {z }
u1
u2
u3
Sources:
(nearly) all discretized problems
most problems in econometric modelling,
. . . and a lot more because. . .
April 2009
221 / 323
Sparsity
Properties of partially separable functions

If f (x) has a sparse Hessian matrix and is sufficiently smooth,
then it is partially separable
(but not conversely: ex : f (x1 , . . . , xn ) =
If f (x) =
Pp
i=1 fi (Ui x)
Pp
Pn
i=1 fi (xi )
i=1 fi (ui ),
x f (x) =
p
X
+ fn+1 (x1 + + xn )
then
UiT x fi (ui )
i=1
xx f (x) =
p
X
UiT xx fi (ui )Ui
i=1
(easy to compute, sparsity determined by Ui )

April 2009
222 / 323
Sparsity
The three points Laplacian operator
On a regular geometric grid

2
B 1
B
B
B
B
@
1
2
1
1
2
1
0
B
B
B
+B
B
@
0
2
C
B 1
C
B
C
B
C=B
C
B
@
1 A
2
1
1
2
1
1
2
1
1
1
1
1
1
1
1
C B
C B
C B
C+B
C B
A @
1
1
C B
C B
C B
C+B
C B
A @
1
1
1
1
1
1
1
C
C
C
C
C
A
C B
C B
C B
C+B
C B
A @
C
C
C
C
C
1 A
2
1
1
Note: Sum of rank one submatrices (ui = xi+1 xi )!
April 2009
223 / 323
Sparsity
Using the partially separable structure
Very useful for:

quasi-Newton Hessian matrix = sum of elementwise quasi-Newton
low rank submatrices (partitioned updating),
elementwise models in DFO (number of functions evaluations only
dependent of the maximum number of internal variables!),
optimally efficient finite-difference approximations,
(structured trust-regions),
expressing large-scale models.
LANCELOT based on an extension of this concept
April 2009
224 / 323
Sparsity
Exploitation of the computational tree
Idea: use computational tree for f (x) for solving Newtons equations
use chain-rule at the top of the computational tree
multiplicative decompositions (and partially separable)
often available from the problem modelling itself
Substantial computational gains
unpublished (?) by T. Coleman (2008)
April 2009
225 / 323
Multilevel problems
5.3: Multilevel problems
April 2009
226 / 323
Multilevel problems
Multilevel Optimization: The Problem
min f (x)
xIRn
f : IRn IR nonlinear, C 2 and bounded below

No convexity assumption
Results from the discretization of some infinite-dimensional problem
on a relatively fine grid for instance (n large)
Iterative search of a first-order critical point x (s.t. f (x ) = 0)
April 2009
227 / 323
Multilevel problems
Hierarchy of problem descriptions

Assume now that a hierarchy of problem descriptions is available, linked by
known operators
Finest problem description
Restriction R
P Prolongation
Fine problem description

Restriction R
P Prolongation
...
Restriction R
P Prolongation
Coarse problem description

Restriction R
P Prolongation
Coarsest problem description

April 2009
228 / 323
Multilevel problems
Grid transfer operators

Restriction
Prolongation
Ri : IRni IRni1
Pi : IRni1 IRni
Ri = PiT
April 2009
229 / 323
Multilevel problems
Sources for such problems
Parameter estimation in
discretized ODEs
discretized PDEs
Optimal control problems
Optimal surface design (shape optimization)
Data assimilation in weather forecast (different levels of physics in the
models)
April 2009
230 / 323
Multilevel problems
The minimum surface problem

Z
1Z 1
min
v
1 + (x v )2 + (y v )2
with the boundary conditions:
f (x), y = 0, 0 x 1
0,
x = 0, 0 y 1
f (x), y = 1, 0 x 1
0,
x = 1, 0 y 1
21
dx dy
Discretization using a finite

element basis
where
f (x) = x (1 x)
April 2009
231 / 323
Multilevel problems
The solution at different levels

solution at level 2 (converged)
2.5
1.5
0.9
0.8
0.7
1
1.5
0.6
0.5
0.4
0.5
0.3
0.2
0.5
0.1
0
5
0
10
0
20
8
2
1
n = 32 = 9
10
solution at level 6
0.25
0.25
0.2
0.2
0.15
0.15
0.2
0.1
0.1
0.1
0.05
0.05
0
40
0
80
30
20
15
20
25
30
n = 312 = 961
35
n = 152 = 225
0.3
5
0
0.4
10
10
0.5
10
15
10
n = 72 = 49
20
15
0
150
60
40
20
0
10
20
30
40
50
n = 632 = 3969
60
70
100
50
0
20
40
60
80
100
120
140
n = 1272 = 16129
April 2009
232 / 323
Multilevel problems
The main issue

globalization technique
&
Efficiency Robustness
Illustration within a trust-region framework
(Unconstrained case)
April 2009
233 / 323
Multilevel problems
Past and recent developments

Line-search
Fisher (1998), Frese-Bouman-Sauer (1999), Nash (2000)
Lewis-Nash (2000, 2002)
Oh-Milstein-Bouman-Webb (2003)
Wen-Goldfarb (2007, 2008)
Gratton-T (2007)
Trust-region
Gratton-Sartenaer-T (2006, 2008)
Gratton-Mouffe-T-Weber Mendonca (2009)
Gratton-Mouffe-Sartenaer-T-Tomanos (2009)
T-Tomanos-Weber Mendonca (2009)
Gross-Krause (2008)
April 2009
234 / 323
Multilevel problems
On the side of multigrid methods

Consider the linear system (discrete Poisson equation, for instance):
Ax = b
Ae = r
(residual equation)
(error)
(exact solution)
(approximation)
where
e = x x
r = b A
x (residual)
Expansion of e in Fourier modes shows high (oscillatory) and low (smooth)

frequency components:
Fourier modes
1
0.8
0.6
0.4
0.2
0
0.2
0.4
k=3
k = 24
0.6
k = 3 and k = 24
0.8
1
0
10
20
30
40
50
60
April 2009
235 / 323
Multilevel problems
Relaxation methods
Basic iterative methods:
correct the i th component of xk in the order 1, 2, . . . , n
to annihilate the i th component of rk
Jacobi
n
X
1
[xk+1 ]i =
aij [xk ]i + [b]i

aii
j=1, j6=i
Gauss-Seidel
[xk+1 ]i =
aii
i1
X
j=1
aij [xk+1 ]i
n
X
aij [xk ]i + [b]i
j=i+1
Solve the equations of the linear system one by one

April 2009
236 / 323
Multilevel problems
Smoothing effect
Very effective methods at smoothing, i.e., eliminating the
high-frequency (oscillatory) components of the error:
0.2
0.02
1.5
0.15
0.015
0.1
0.01
0.5
0.05
0.005
0
40
30
30
20
20
10
10
0
error of
initial guess
0
40
30
40
30
20
20
10
10
0
error after 10
GS iterations
30
40
30
20
20
10
10
0
error after 100

GS iterations
But they leave the low-frequency (smooth) components relatively

unchanged
April 2009
237 / 323
Multilevel problems
Multigrid in linear algebar
Assume now (two levels):
A fine grid (f ) description
Ae = r Af e f = r f
A coarse grid (c) description
Ac e c = r c
Linked by transfer operators
Ac = RAf P, e c = Re f , r c = Rr f
April 2009
238 / 323
Multilevel problems
Coarse grid principle
Smooth error modes on a fine grid

look less smooth on a coarse grid
When relaxation begins to stall at the finer level:
Move to the coarser grid where the smooth error modes appear more
oscillatory
Apply a relaxation at the coarser level:
more efficient
substantially less expensive
April 2009
239 / 323
Multilevel problems
Two-grid correction scheme
Annihilate oscillatory error level by level:
Fine
smooth
Smooth fine
Smaller fine
R
Oscil. coarse
P
smooth
(recur)
smooth
Smooth coarse
Note: P and R are not othogonal projectors!

A very efficient method for some linear systems
(when A(smooth modes) smooth modes)
April 2009
240 / 323
Multilevel problems
Does it work?
Smoothing on fine grid only:
2
0.2
0.02
1.5
0.15
0.015
0.1
0.01
0.5
0.05
0.005
0
40
30
0
40
30
30
20
10
40
20
10
10
0
30
20
20
10
10
0
30
40
30
20
20
10
0
Two-grid correction scheme:

2
0.2
0.02
1.5
0.15
0.015
0.1
0.01
0.5
0.05
0.005
0
30
0
30
30
20
30
30
20
20
10
10
0
k=0
30
20
20
10
10
0
k = 10
20
10
10
0
k = 100
April 2009
241 / 323
Multilevel problems
V-cycle
k+1
Smoothing
April 2009
242 / 323
Multilevel problems
W-cycle
k+1
Smoothing
April 2009
243 / 323
Multilevel problems
Mesh Refinement
Solve the problem on the coarsest level

Good starting point for the next fine level
Do the same on each level

Good starting point for the finest level
Finally solve the problem on the finest level
April 2009
244 / 323
Multilevel problems
Full Multigrid Scheme
Combination of Mesh Refinement and V-cycles
April 2009
245 / 323
Multilevel problems
Return to optimization
Trust-region technique
&
Efficiency Robustness
Multilevel optimization method
Note: Multilevel More-Sorensen algorithm: (Hk + I ) s = gk

T-Tomanos-Weber Mendonca, 2009
April 2009
246 / 323
Multilevel problems
The framework
Assume that we have:
A hierarchy of problem descriptions of f :
{fi }ri=0
with
fr (x) = f (x)
Transfer operators, for i = 1, . . . , r :

Ri : IRni IRni1
(the restriction)
Pi : IRni1 IRni
(the prolongation)
Terminology: a particular i is referred to as a level

April 2009
247 / 323
Multilevel problems
The idea
min fr (x) = f (x)
xIRn
at xk :
minimize Taylors model of fr around xk

in the trust region of radius k
or (whenever suitable and desirable)

at xk :
compute fr (xk ) (possibly Hk )
trial step sk
Restriction R
P Prolongation
use fr 1 to construct a coarse local model of fr

and minimize it within the trust region of radius k
If more than two levels are available (r > 1), do this recursively
April 2009
248 / 323
Multilevel problems
Example of recursion with 5 levels (r = 4)

k
Level 4
R4
Level 3
P4
0
R3
Level 2
P3
0
R2
Level 1
P2
0
R1
Level 0
P1
0
Notation:
R2
P2
0
R1
P 1 R1
0
P1
0
i: level index (0 i r )
k: index of the current iteration within level i
April 2009
249 / 323
Multilevel problems
Construction of the coarse local models

If fi 6= 0 for i = 0, . . . , r 1
Impose first-order coherence via a correction term:
glow = Rgup
Impose second-order coherence() via two correction terms:
glow = Rgup
and
Hlow = RHup P
() Not needed to derive first-order global convergence
If fi = 0 for i = 0, . . . , r 1
Galerkin model: Restricted version of the quadratic model at the
upper level
April 2009
250 / 323
Multilevel problems
Preserving the trust-region constraint (1)
up
xlow,0
xlow,k
up kxlow,k xlow,0 k
+
low

min +
,
kx
x
k
up
low
,k
low
,0
low
Note: Motivation to switch to -norm

Gratton, Sartenear, T (2008)
April 2009
251 / 323
Multilevel problems
Preserving the trust-region constraint (2)
In infinity norm:

min +
low , up kxlow ,k xlow ,0 k
Gratton, Mouffe, T, Weber Mendonca (2008)
April 2009
252 / 323
Multilevel problems
Use the coarse model whenever suitable
def
When kglow k = kRgup k kgup k
(Coarsening condition)
and
def
When kglow k = kRgup k > low

and
When i > 0
April 2009
253 / 323
Multilevel problems
Use the coarse model whenever desirable
Taylor model (Taylor step)
Coarse model (recursive step)
smoothing
coarsening
&
Alternate for efficiency (multigrid)
Be as flexible as possible
Leave the choice even when the coarse model is suitable
April 2009
254 / 323
Multilevel problems
Recursive multilevel trust-region algorithm (RMTR)

At iteration k
(until convergence):
Choose either a Taylor or (if suitable) a coarse local model

(first-order coherent):
Taylor model: compute a Taylor step
Coarse local model: apply the algorithm recursively
Evaluate the change in the objective function
If achieved decrease predicted decrease, then
accept the trial point
possibly enlarge the trust region
else
keep the current point
shrink the trust region
Impose current trust region upper level trust region
April 2009
255 / 323
Multilevel problems
Global convergence
Based on the trust-region technology
Uses the sufficient decrease argument (imposed in Taylors iterations)
Plus the coarsening condition (kRgup k kgup k)
Main result
lim kgr ,k k = 0
Gratton, Sartenaer, (2008)

April 2009
256 / 323
In more details
Intermediate results
At iteration (i, k) we associate the set:
def
R(i, k) = {(j, `) | iteration (j, `) occurs within iteration (i, k)}
Level 4
R4
Level 3
P4
0
R3
Level 2
P3
0
R2
Level 1
P2
0
R1
Level 0
P1
0
R2
P2
0
R1
P 1 R1
0
P1
0
April 2009
257 / 323
In more details
Let
def
V(i, k) = { (j, `) R(i, k) | mj,` kgi,k kj,` }

|
{z
}
sufficient decrease
Then, at a non critical point and if the trust region is small enough:
V(i, k) = R(i, k)
Back to classical trust-region arguments
April 2009
258 / 323
In more details
Premature termination
For a recursive iteration (i, k):
A minimization sequence at level i 1 initiated at iteration (i, k)
denotes all successive iterations at level i 1
until a return is made to level i
k
Level 4
R4
Level 3
R3
Level 2
P3
0
R2
Level 1
P2
0
R1
Level 0
P1
0
R2
P2
0
R1
P 1 R1
0
P1
0
April 2009
259 / 323
In more details
Properties of RMTR
Each minimization sequence contains at least one successful iteration
Premature termination in that case does not affect the convergence
results at the upper level
Which allows
Stop a minimization sequence after a preset number of successful
iterations
Use fixed lower-iterations patterns like the V or W cycles in multigrid
methods
April 2009
260 / 323
In more details
A practical RMTR algorithm: Taylor iterations
At the coarsest level

Solve using the exact More-Sorensen method
(small dimension)
At finer levels
Smooth using a smoothing technique from multigrid
(to reduce the high frequency residual/gradient components)
April 2009
261 / 323
In more details
SCM Smoothing
Adaptation of the Gauss-Seidel smoothing technique to optimization:

Sequential Coordinate Minimization (SCM smoothing)
Successive one-dimensional minimizations of the model

along the coordinate axes when positive curvature
Cost: 1 SCM smoothing cycle 1 matrix-vector product
April 2009
262 / 323
In more details
Three issues
How to impose sufficient decrease in the model ?
How to impose the trust-region constraint ?
What to do if a negative curvature is encountered ?
April 2009
263 / 323
In more details
Start the first SCM smoothing cycle

by minimizing along the largest gradient component
(enough to ensure sufficient decrease)
Perform (at most) p SCM smoothing cycles
while inside the trust region (reasonable cost)
Terminate
when an approximate minimizer is found (Stop)
when the trust-region boundary is passed (Stop at the boundary)
when a direction of negative curvature is encountered
(move to the boundary and Stop)
April 2009
264 / 323
In more details
Convergence to weak minimizers

SCM smoothing limits its exploration of the models curvature to the
coordinate axes only guarantees asymptotic positive curvature:
along the coordinate axes at the finest level (i = r )
along the the prolongation of the coordinate axes at levels
i = 1, . . . , r 1
along the prolongation of the coarsest subspace (i = 0)
Weak minimizers
Gratton, Sartenaer, T (2006)
April 2009
265 / 323
Numerical results
Some numerical flavor

Gratton, Mouffe, Sartenaer, T, Tomanos (2009)
All on Finest (AF)

Standard Newton trust-region algorithm (TCG)
Applied at the finest level
Multilevel on Finest (MF)
Algorithm RMTR
Applied at the finest level
Mesh Refinement (MR)
Standard Newton trust-region algorithm (TCG)
Applied successively from coarsest to finest level()
Full Multilevel (FM)
Algorithm RMTR
Applied successively from coarsest to finest level()
()
Starting point at level i + 1 obtained by prolongating the solution at level i

April 2009
266 / 323
Numerical results
Test problem characteristics

Problem name
DNT
P2D
P3D
DEPT
DPJB?
DODC
MINS-SB
MINS-OB
MINS-DMSA
IGNISC
DSSC
BRATU
MINS-BC?
MEMBR?
NCCS
NCCO
MOREBV
nr
511
1.046.529
250.047
1.046.529
1.046.529
65.025
1.046.529
65.025
65.025
65.025
1.046.529
1.046.529
65.025
393.984
103.050
103.050
1.046.529
r
8
9
5
9
9
7
9
7
7
7
9
9
7
9
7
7
9
Type
1-D,
2-D,
3-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
quadratic
quadratic
quadratic
quadratic
quadratic
convex
convex
convex
convex
convex
convex
convex
convex
convex
nonconvex
nonconvex
nonconvex
Description
Dirichlet-to-Neumann transfer problem
Poisson model problem
Poisson model problem
Elastic-plastic torsion problem
Journal bearing problem
Optimal design problem
Minimium surface problem
Combustion problem
Combustion problem
Combustion problem
Membrane problem
Optimal control problem
Optimal control problem
Boundary value problem
?: with bound constraints

April 2009
267 / 323
Numerical results
Performance profiles (CPU time)

1
0.8
0.6
0.4
FM
MR
MF
AF
0.2
10
20
30
40
50
60
70
80
90
100
April 2009
268 / 323
Numerical results
Zoom on on efficiency (CPU time)

1
0.8
0.6
0.4
FM
MR
MF
AF
0.2
0
1
10
April 2009
269 / 323
Numerical results
CPU times
Problem name
DNT
P2D
P3D
DEPT
DPJB
DODC
MINS-SB
MINS-OB
MINS-DMSA
IGNISC
DSSC
BRATU
MINS-BC
MEMBR
NCCS
NCCO
MOREBV
Best
AF
5.2
1122.8
626.1
1364.4
3600.0
894.8
3600.0
1445.6
1196.8
2330.4
3183.8
2314.1
2706.4
1082.0
3600.0
3600.0
3600.0
MF
24.4
MR
72.8
569.7
47.5
18.3
4.6
69.5
95.4
1390.0
247.7
58.6
70.4
73.4
398.3
184.2
3600.0
116.7
289.6
488.2
1051.6
236.8
122.3
91.7
3600.0
161.8
524.6
335.2
3600.0
3600.0
292.4
279.5
3589.6
704.9
3600.0
FM
6.7
26.0
28.8
8.6
83.6
36
153.9
27.5
18.2
398.2
12.1
10.1
140.0
154.0
331.9
224.2
41.7
Second best
April 2009
270 / 323
Numerical results
A glimpse at the solution process
April 2009
271 / 323
Bibliography

1
T. F. Coleman and J. J. Mor

e,
Estimation of sparse Jacobian matrices and graph coloring problems,
SIAM Journal on Numerical Analysis, 20:187-209, 1983.
T. F. Coleman and J. J. Mor
e,
Estimation of sparse Hessian matrices and graph coloring problems,
Mathematical Programming, 28:243-270, 1984.
LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A),
Springer Verlag, Springer Series in Computational Mathematics 17, Heidelberg, 1992.
A. Curtis, M. J. D. Powell and J. Reid,
On The Estimation of Sparse Jacobian Matrices,
IMA Journal, 13:117-119, 1974.
M. Fisher,
Minimization Algorithms for Variational Data Assimilation,
in Recent Developments in Numerical Methods for Atmospheric Modelling, European Center for Medium-Range
Weather Forecasts, Reading, UK, pp. 364-385, 1998.
T. Frese, Ch. Bouman and K. Sauer,
Multiscale Bayesian Methods for Discrete Tomography,
in Discrete Tomography: Foundations, Algorithms and Applications (G. Herman and A. Kuba, eds.), Birkhauser,
Boston, pp. 237-261, 1999.
D. Goldfarb and Ph. L. Toint,
Optimal Estimation of Jacobian and Hessian Matrices That Arise in Finite Difference Calculations,
Mathematics of Computation, 43(167):69-88, 1984.
S. Gratton, M. Mouffe, Ph. L. Toint and M. Weber-Mendonca,
A recursive trust-region method in infinity norm for bound-constrained nonlinear optimization,
S. Gratton, M. Mouffe, A. Sartenaer, Ph. L. Toint and D. Tomanos,
Numerical Experience with a Recursive Trust-Region Method for Multilevel Nonlinear Optimization,
Optimization Methods and Software, to appear, 2009.
April 2009
272 / 323
Bibliography

10 S. Gratton, A. Sartenaer and Ph. L. Toint,
Recursive Trust-Region Methods for Multiscale Nonlinear Optimization, SIAM Journal on Optimization,
19(1):414-444, 2008.
11 S. Gratton, A. Sartenaer and Ph. L. Toint,
Second-order convergence properties of trust-region methods using incomplete curvature information, with an
application to multigrid optimization,
Journal of Computational and Applied Mathematics, 24(6):676-692, 2006.
12 S. Gratton and Ph. L. Toint,
Multi-Secant Equations, Approximate Invariant Subspaces and Multigrid Optimization,
FUNDP, Namur, Report 07/11, 2007.
13 A. Griewank and Ph. L. Toint,
On the unconstrained optimization of partially separable functions,
in Ninlinear Optimization 1981, (M. J. D. Powell, ed.), Academic Press, pp. 302-312, 1982.
14 S. G. Nash,
A Multigrid Approach to Discretized Optimization Problems,
Optimization Methods and Software, 14:99-116, 2000.
15 M. Lewis and S. G. Nash,
Model problems for the multigrid optimization of systems governed by differential equations,
SIAM Journal on Scientific Computing, 26(6):1811-1837, 2005.
16 S. Oh, A. Milstein, Ch. Bouman and K. Webb,
Multigrid algorithms for optimization and inverse problems,
in Computational Imaging(Ch. Bouman and R. Stevenson, eds.), DDM, Proceedings of the SPIE, 5016:59-70, 2003.
17 M. J. D. Powell and Ph. L. Toint,
On The Estimation of Sparse Hessian Matrices,
18 Ph. L. Toint, D. Tomanos and M. Weber-Mendonca,
A multilevel algorithm for solving the trust-region subproblem,
Optimization Methods and Software, 24(2):299-311, 2009.
19 Ch. Gross and R. Krause,
On The Convergence of Recursive Trust-Region Methods for Multiscale Nonlinear Optimization and Applications to
Nonlinear Mechanics,
University of Bonn, Germany, 2008.
April 2009
273 / 323
Bibliography
20 Z. Wen and D. Goldfarb,

A Linesearch Multigrid Methods for Large-Scale Convex Optimization,
Department of Industrial Engineering and Operations Research, Columbia University, New York, July 2007.
April 2009
274 / 323
Lesson 6:
Cubic and quadratic
regularization methods:
a path towards
nonlinear step control
April 2009
275 / 323
Outline
Regularization for unconstrained problems

1
2
cubic
quadratic
Nonlinear step control
Cubic regularization for constrained problems
Conclusions
Bibliography
April 2009
276 / 323
Regularization methods for unconstrained problems
Regularization techniques
for unconstrained Problems
April 2009
277 / 323
The problem
We return to the unconstrained nonlinear programming problem:

minimize f (x)
for x IRn and f : IRn IR smooth.
Important special case: the nonlinear least-squares problem
minimize f (x) = 12 kF (x)k2
for x IRn and F : IRn IRm smooth.
April 2009
278 / 323
Unconstrained optimization a mature area?

f (x) where f C 1
minimize
n
xIR
(maybe
C2 )
Currently two main competing (but similar) methodologies

Linesearch methods
compute a descent direction sk from xk
set xk+1 = xk + k sk to improve f
compute a step sk from xk to improve a model mk of f
within the trust-region ksk k
set xk+1 = xk + sk if mk and f agree at xk + sk
otherwise set xk+1 = xk and reduce the radius
April 2009
279 / 323
A useful theoretical observation

Consider trust-region method where
model = true objective function
Then
model and objective always agree
trust-region radius goes to infinity
a linesearch method
Nice consequence:
A unique convergence theory!

(Shultz/Schnabel/Byrd, 1985, T., 1988
April 2009
280 / 323
The keys to convergence theory for trust regions

The Cauchy condition:

kgk k
, k
mk (xk ) mk (xk + sk ) TR kgk k min
1 + kHk k
The bound on the stepsize:

ksk
And we derive:
Global convergence to first/second-order critical points
Is there anything more to say?
April 2009
281 / 323
Is there anything more to say?

Observe the following: if
f has gradient g and globally Lipschitz continuous Hessian H with
constant 2L
Taylor, Cauchy-Schwarz and Lipschitz imply
f (x + s)
= f (x) + hs, g (x)i + 12 hs, H(x)si

R1
+ 0 (1 )hs, [H(x + s) H(x)]si d
f (x) + hs, g (x)i + 12 hs, H(x)si + 13 Lksk32
|
{z
}
m(s)
= reducing m from s = 0 improves f since m(0) = f (x).
April 2009
282 / 323
The cubic regularization

Change from
f (x) + hs, g (x)i + 12 hs, H(x)si s.t. ksk
min
s
to
min
s
f (x) + hs, g (x)i + 12 hs, H(x)si + 13 ksk3
is the (adaptive) regularization parameter

(ideas from Griewank, Weiser/Deuflhard/Erdmann, Nesterov/Polyak, Cartis/Gould/T)
April 2009
283 / 323
Cubic regularization highlights

f (x + s) m(s) f (x) + s T g (x) + 21 s T H(x)s + 13 Lksk32
Nesterov and Polyak minimize m globally

N.B. m may be non-convex!
efficient scheme to do so if H has sparse factors
global (ultimately rapid) convergence to a 2nd-order critical point of f

better worst-case function-evaluation complexity than previously
known
Obvious questions:
can we avoid the global Lipschitz requirement?
can we approximately minimize m and retain good worst-case
function-evaluation complexity?
does this work well in practice?
April 2009
284 / 323
Cubic overestimation
Assume
f C2
f , g and H at xk are fk , gk and Hk
symmetric approximation Bk to Hk
Bk and Hk bounded at points of interest
Use
cubic overestimating model at xk
mk (s) fk + s T gk + 12 s T Bk s + 31 k ksk32
k is the iteration-dependent regularisation weight
p
easily generalized for regularisation in Mk -norm kskMk = s T Mk s
where Mk is uniformly positive definite
April 2009
285 / 323
Adaptive Regularization with Cubic (ARC)

Algorithm 6.1: The ARC Algorithm
Step 0: Initialization: x0 and 0 > 0 given. Set k = 0
Step 1: Step computation: Compute sk for which mk (sk ) mk (skC )
Cauchy point: skC = kC gk & kC = arg min mk (gk )
IR+
Step 2: Step acceptance: Compute k =
f (xk ) mk (sk )

x k + sk
if k > 0.1
and set xk+1 =
xk
otherwise
Step 3: Update the regularization parameter:
k+1
(0, k ] = 21 k if k > 0.9
very successful
[ , 1 k ] = k if 0.1 k 0.9 successful

k
[1 k , 2 k ] = 2k otherwise
unsuccessful
April 2009
286 / 323
Local convergence theory for cubic regularization (1)

kgk k
mk (xk ) mk (xk + sk ) CR kgk k min
,
1 + kHk k
kgk k
k
kHk k
ksk k 3 max
,
k
kgk k
k
(Cartis/Gould/T)
April 2009
287 / 323
Local convergence theory for cubic regularization (2)

And therefore. . .
lim kgk k = 0
first-order global convergence

Under stronger assumptions can show that
If sk minimizes mk over subspace with orthogonal basis Qk ,
lim QkT Hk Qk 0
second-order global convergence

April 2009
288 / 323
Fast convergence
For fast asymptotic convergence = need to improve on Cauchy point:
minimize over Krylov subspaces
1
g stopping-rule: ks mk (sk )k min(1, kgk k 2 )kgk k
s stopping-rule: ks mk (sk )k min(1, ksk k )kgk k
If Bk satisfies the Dennis-More condition
k(Bk Hk )sk k/ksk k 0 whenever kgk k 0
and xk x with positive definite H(x )
= Q-superlinear convergence of xk under the g- and s-rules
If additionally H(x) is locally Lipschitz around x and
k(Bk Hk )sk k = O(ksk k2 )
=
Q-quadratic convergence of xk under the s-rule
April 2009
289 / 323
Function-evaluation complexity
How many function evaluations (iterations) are needed to ensure that
kgk k ?
so long as for very successful iterations k+1 3 k for 3 < 1
= basic ARC algorithm requires at most
l m
C function evaluations
2
for some C independent of
c.f. steepest descent
if H is globally Lipschitz, the s-rule is applied and additionally sk is
the global (line) minimizer of mk (sk ) as a function of
= ARC algorithm requires at most
l
m
S
function evaluations
3/2

for some S independent of
c.f. Nesterov & Polyak

April 2009
290 / 323
Minimizing the model

m(s) f + s T g + 12 s T Bs + 13 ksk32
Derivatives:
= ksk2
s m(s) = g + Bs + s

ss m(s) = B + I +
s
ksk

s
ksk
T
Optimality: any global minimizer s of m satisfies

(B + I )s = g
= ks k2
B + I is positive semi-definite
April 2009
291 / 323
The (adapted) secular equation

Require
(B + I )s = g
and
= ksk2
Define s():
(B + I )s() = g
and find scalar as the root of secular equations
=0
ks()k2
or
1
=0
ks()k2
or
=0
ks()k2
values and derivatives of s() satisfy linear systems with symmetric

positive definite B + I
need to be able to factorize B + I
April 2009
292 / 323
Plots of secular functions against

Example: g = (0.25 1)T , H = diag(1 1) and = 2
ks()k2
=0
=0
ks()k2
=0
ks()k2
April 2009
293 / 323
Large problems approximate solutions

Seek instead global minimizer of m(s) in a j-dimensional (j n) subspace
S IRn
g S = ARC algorithm globally convergent
Q orthogonal basis for S
s = Qu where
u = arg minj f + u T (Q T g ) + 12 u T (Q T BQ)u + 31 kuk32

uIR
= use secular equation to find u
if S is the Krylov space generated by {B i g }j1
i=0
= Q T BQ = T , tridiagonal
= can factor T + I to solve secular equation even if j is large
using g- or s-stopping rule = fast asymptotic convergence for ARC
using s-stopping rule = good function-evaluation complexity for
ARC
April 2009
294 / 323
The main features of adaptive cubic regularization
And the result is. . .

longer steps on ill-conditioned problems
similar (very satisfactory) convergence analysis
best function-evaluation complexity for nonconvex problems
excellent performance and reliability
April 2009
295 / 323
Numerical experience small problems using Matlab

Performance Profile: iteration count 131 CUTEr problems
1
fraction of problems for which method within of best
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
ACO g stopping rule (3 failures)
ACO s stopping rule (3 failures)
trustregion (8 failures)
0.1
1.5
2.5
3.5
4.5
April 2009
296 / 323
The quadratic regularization for NLS (ARQ)

Consider the Gauss-Newton method for nonlinear least-squares problems.
Change from
min
s
1
2
kc(x)k2 + hs, J(x)T c(x)i + 12 hs, J(x)T J(x)si s.t. ksk
to
min
s
kc(x) + J(x)sk + 12 ksk2

(idea by Nesterov)
April 2009
297 / 323
Quadratic regularization: reformulation

Note that
min
s
kc(x) + J(x)sk + 12 ksk2
min
,s
+ 12 ksk2 such that kc(x) + J(x)sk2 = 2
exact penalty function for the problem of minimizing ksk subject to

c(x) + J(x)s = 0.
Iterative techniques. . . as for the cubic case (Cartis, Gould,T.):
solve the problem in nested Krylov subspaces
Lanczos factorization of tridiagonal matrices
different scalar secular equation (solution by Newtons method)
April 2009
298 / 323
The keys to convergence theory for quadratic regularization

kJkT ck k
kJkT ck k
kJkT ck k
,
m(xk ) m(xk + sk ) QR
min
kck k
1 + kJkT Jk k k kck k
ksk k
1 kJkT ck k
2 k kck k
April 2009
299 / 323
Convergence theory for the quadratic regularization

Convergence results:
Global convergence to first-order critical points
Quadratic convergence to roots

Valid for
general values of m and n,
exact/approximate subproblem solution
(Bellavia/Cartis/Gould/Morini/T.)
April 2009
300 / 323
Nonlinear stepsize control
6.2: A unifying concept:

nonlinear stepsize control
April 2009
301 / 323
Towards a unified global convergence theory
Objectives:
recover a unified global convergence theory
possibly open the door for new algorithms
Idea:
cast all three methods into a generalized TR framework
allow this TR to be updated nonlinearly
April 2009
302 / 323
Towards a unified global convergence theory (2)

Given
3 continuous first-order criticality measures (x), (x), (x)
an adaptive stepsize parameter
define a generalized radius (, (x)) such that
(, ) is C 1 , strictly increasing and concave,
(0, ) = 0 for all ,
(, ) is non-increasing
(, ) (, )
...
April 2009
303 / 323
Regularization techniques for constrained problems
6.3: Cubic regularization

for constrained problems
April 2009
304 / 323
The constrained case

Can we apply regularization to the constrained case?
Consider the constrained nonlinear programming problem:
minimize f (x)
x F
for x IRn and f : IRn IR smooth, and where
F is convex.
Main ideas:
exploit (cheap) projections on convex sets
define using the generalized Cauchy point idea
prove global convergence + function-evaluation complexity
April 2009
305 / 323
Constrained step computation (1)
min
s
f (x) + hs, g (x)i + 12 hs, H(x)si + 13 ksk3
subject to
x +s F

Criticality measure: (as before)

(x) =
min
hx f (x), di ,
x+dF ,kdk1
def
April 2009
306 / 323
The generalized Cauchy point for ARC

Cauchy step: Goldstein-like piecewise linear seach on mk along the
gradient path projected onto F
Find
def
xkGC = PF [xk tkGC gk ] = xk + skGC
(tkGC > 0)
such that
mk (xkGC ) f (xk ) + ubs hgk , skGC i
(below linear approximation)
and either
mk (xkGC ) f (xk ) + lbs hgk , skGC i
(above linear approximation)
or
kPT (x GC ) [gk ]k epp |hgk , skGC i|
k
(close to paths end)
no trust-region condition!
April 2009
307 / 323
Searching for the ARC-GCP

5
4
3
2
1
0
1
2
3
4
5
0
0.2
0.4
0.6
0.8
1.2
1.4
mk (0 + s) = 3.57s1 1.5s2 s3 + s1 s2 + 3s2 + s2 s3 2s3 + 31 ksk

1.6
such that s 1.5
April 2009
308 / 323
Remember the same for a quadratic model?

5
4
3
2
1
0
1
2
3
4
5
6
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
April 2009
309 / 323
A constrained regularized algorithm

Algorithm 6.2: ARC for Convex Constraints (COCARC)
Step 0: Initialization. x0 F, 0 given. Compute f (x0 ), set k = 0.
Step 1: Generalized Cauchy point. If xk not critical, find the
generalized Cauchy point xkGC by piecewise linear search on the
regularized cubic model.
Step 2: Step calculation. Compute sk and xk+ = xk + sk F such
that mk (xk+ ) mk (xkGC ).
def
Step 3: Acceptance of the trial point. Compute f (xk+ ) and k .

If k 1 , then xk+1 = xk + sk ; otherwise xk+1 = xk .
Step 4: Regularisation parameter update. Set
if k 2 ,
(0, k ]
[k , 1 k ]
if k [1 , 2 ),
k+1
[1 k , 2 k ] if k < 1 .
April 2009
310 / 323
Local convergence theory for COCARC


k
,
mk (xk ) mk (xk + sk ) CR k min
1 + kHk k
k
,1
k

"
kHk k
ksk k 3 max
,
k
k
k
1 1 #
2
k 3
,
k
And therefore. . .
lim k = 0
(Cartis/Gould/T)
April 2009
311 / 323
Function-Evaluation Complexity for COCARC (1)

But
What about function-evaluation complexity?
If, for very successful iterations, k+1 3 k for 3 < 1,
the COCARC algorithm requires at most
l m
2
(for some C independent of ) to achieve k
c.f. steepest descent
Do the nicer bounds for unconstrained optimization extend to the
constrained case?
April 2009
312 / 323
Function-evaluation complexity for COCARC (2)
As for unconstrained, impose a termination rule on the subproblem

solution:
Do not terminate solving minxk +sF mk (xk + s) before
+
m
k (xk ) min(stop , ksk k) k
where

def
m
k (x) =
min
hx mk (x), di
x+dF ,kdk1
c.f. the s-rule for unconstrained
Note: OK at local constrained model minimizers
April 2009
313 / 323
Walking through the pass...

2
1
x g
k
x+
min
2
feasible
3
7
A beyond the pass constrained problem with

m(x, y ) = x
42
100
3
10
x2
1
10
y 3 + 13 [x 2 + y 2 ] 2
April 2009
314 / 323
Walking through the pass...with a sherpa

2
x
k,c
1
x
x g
k,a
+
xk
2
feasible
3
7
A piecewise descent path from xk to xk+ on

m(x, y ) = x
42
100
3
10
x2
1
10
y 3 + 13 [x 2 + y 2 ] 2
April 2009
315 / 323
Function-Evaluation Complexity for COCARC (2)

Assume also
xk xk+ in a bounded number of feasible descent substeps
kHk xx f (xk )k ksk k2
xx f () is globally Lipschitz continuous

{xk } bounded
The COCARC algorithm requires at most

3/2
(for some C independent of ) to achieve k
Caveat: cost of solving the subproblem
c.f. unconstrained case!!!

April 2009
316 / 323
Conclusions
Conclusions for lesson 6
Much left to do. . . but very interesting

Unconstrained nonliear stepsize control could lead to very untypical
methods. Example:
p
k = k = k = kgk k,
(, ) =
Meaningful numerical evaluation still needed for many of these
algorithms
Many issues regarding regularizations still unresolved
April 2009
317 / 323
Bibliography
Bibliography for lesson 6

R. H. Byrd, R. B. Schnabel and G. A. Shultz,
A trust region algorithm for nonlinearly constrained optimization,
SIAM Journal on Numerical Analysis, 24: 11521170, 1987.
Ph. L. Toint,
Global convergence of a class of trust region methods for nonconvex minimization in Hilbert space,
IMA Journal of Numerical Analysis, 8(2): 231252, 1988.
A. Griewank,
The modification of Newtons method for unconstrained optimization by bounding cubic terms,
Department of Applied Mathematics and Theoretical Physics, University of Cambridge (UK), Report NA/12, 1981.
M. Weiser, P. Deuflhard and B. Erdmann,
Affine conjugate adaptive Newton methods for nonlinear elastomechanics,
Optimization Methods and Software, 22(3): 413431, 2007.
Yu. Nesterov and B. T. Polyak,
Cubic regularization of Newton method and its global performance,
Mathematical Programming, 108(1): 177-205, 2006.
C. Cartis and N. I. M. Gould and Ph. L. Toint,
Adaptive cubic overestimation methods for unconstrained optimization,
C. Cartis, N. I. M. Gould and Ph. L. Toint,
Trust-region and other regularisation of linear least-squares problems,
BIT, to appear, 2009.
Yu. Nesterov,
Modified Gauss-Newton scheme with worst-case guarantees for global performance,
Optimization Methods and Software, 22(3): 469483, 2007.
S. Bellavia, C. Cartis, N. I. M. Gould, B. Morini and Ph. L. Toint,
Convergence of a Regularized Euclidean Residual Algorithm for Nonlinear Least-Squares,
C. Cartis, N. I. M. Gould and Ph. L. Toint,
An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its
function-evaluation complexity,
FUNDP, Namur, Report 08/05R, 2009.
April 2009
318 / 323
Conclusions
Not covered in the course
non-smooth techniques
specifically convex problems
penalty functions
augmented Lagrangians
affine scaling methods
general sequential quadratic programming (SQP)
systems of nonlinear equations
...
Many thanks to you all for your patience!
April 2009
319 / 323

Advanced Algorithms in Nonlinear Optimization

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Advanced Algorithms in Nonlinear Optimization

Încărcat de

Drepturi de autor:

Formate disponibile

Advanced Algorithms in Nonlinear Optimization

Belgian Francqui Chair, Leuven, April 2009

Nonlinear optimization: motivation, past and perspectives

Trust region methods for unconstrained problems

Derivative free optimization, filters and other topics

Convex constraints and interior-point methods

The use of problem structure for large-scale applications

Regularization methods and nonlinear step control

Philippe Toint (Namur)

My grateful thanks to them all.

Philippe Toint (Namur)

Nonlinear optimization: motivation, past and perspectives

Definition and examples

The best choice subject to constraints

Philippe Toint (Namur)

criterion, objective function

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Note: maximize f (x) equivalent to minimize f (x).

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Philippe Toint (Namur)

Nonlinear optimization: motivation, past and perspectives

Definition and examples

People optimize (daily)

Philippe Toint (Namur)

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: PAL design (1)

Loos, Greiner, Seidel (1997)

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: PAL design (2)

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: PAL design (3)

The surface astigmatism is then

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: Food sterilization (1)

Philippe Toint (Namur)

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: Food sterilization (2)

(with suitable boundary conditions: coolant, initial temperature,. . . )

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: biological parameter estimation (1)

Doyle et al. (1998)

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: biological parameter estimation (2)

in a Pacific spiny lobster!

Simmers, Meyrand and Moulin (1995)

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: biological parameter estimation (3)

Philippe Toint (Namur)

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: data assimilation for weather forecasting (1)

(Attempt to) predict. . .

Philippe Toint (Namur)

Nonlinear optimization: motivation, past and perspectives