Sunteți pe pagina 1din 430

Advanced Algorithms in Nonlinear Optimization

Philippe Toint
Department of Mathematics, University of Namur, Belgium
( philippe.toint@fundp.ac.be )

Belgian Francqui Chair, Leuven, April 2009

Outline
1

Nonlinear optimization: motivation, past and perspectives

Trust region methods for unconstrained problems

Derivative free optimization, filters and other topics

Convex constraints and interior-point methods

The use of problem structure for large-scale applications

Regularization methods and nonlinear step control

Conclusions

Philippe Toint (Namur)

April 2009

2 / 323

Acknowledgements
This course would not have been possible without
the Francqui Foundation and the Katholieke Universiteit Leuven,
Moritz Diehl, Dirk Roose and Stefan Vandewalle (the gentle
organizers),
Fabian Bastin, Stefania Bellavia, Cinzia Cirillo, Coralia Cartis, Andy
Conn, Nick Gould, Serge Gratton, Sven Leyffer, Vincent Malmedy,
Benedetta Morini, Melodie Mouffe, Annick Sartenaer, Katya
Scheinberg, Dimitri Tomanos, Melissa Weber-Mendonca (my patient
co-authors).
Ke Chen, Patrick Laloyaux (who supplied pictures)

My grateful thanks to them all.

Philippe Toint (Namur)

April 2009

3 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

What is optimization?

The best choice subject to constraints

best
choice
constraints

Philippe Toint (Namur)

criterion, objective function


variables whose value may be chosen
restrictions on allowed values of the variables

April 2009

4 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

More formally
variables
objective function
constraints

x = (x1 , x2 , . . . , xn )
minimize/maximize f (x)
c(x) 0

Note: maximize f (x) equivalent to minimize f (x).


min f (x)
x

such that
c(x) 0
(the general nonlinear optimization problem)
(+ conditions on x, f and c)
Philippe Toint (Namur)

April 2009

5 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Nature optimizes

Philippe Toint (Namur)

April 2009

6 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

People optimize (daily)

Philippe Toint (Namur)

April 2009

7 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: PAL design (1)


Design of modern Progressive Adaptive Lenses:
vary optical power of lenses while minimizing astigmatism

a)

power

b)

astigmatism

Loos, Greiner, Seidel (1997)


Philippe Toint (Namur)

April 2009

8 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: PAL design (2)


Achievements: Loos, Greiner, Seidel (1997)

uncorrected
long distance
Philippe Toint (Namur)

short distance
PAL
April 2009

9 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: PAL design (3)


Is this nonlinear ( difficult)?
Assume the lens surface is z = z(x, y ). The optical power is
"
#
 2 ! 2
 2 ! 2
z
z
z
N3
z
z z 2 z
1+
p(x, y ) =
+ 1+
2
2
x
y 2
y
x 2
x y xy
where

1
N = N(x, y ) = r
h i2 h i2 .
1 + z + z
x
y

The surface astigmatism is then


v
u
u
a(x, y ) = 2tp(x, y ) N 4
Philippe Toint (Namur)

 2 2 !
z z
z

x y
xy
April 2009

10 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: Food sterilization (1)


A common problem in the food processing industry:
keep a max of vitamins while killing a prescribed fraction of the bacteria
heating in steam/hot water autoclaves

Sachs (2003)

Philippe Toint (Namur)

April 2009

11 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: Food sterilization (2)


Model: coupled PDEs
Concentration of micro-organisms and other nutrients:
C
(x, t) = K [(x, t)]C (x, t),
t
where (x, t) is the temperature, and where
K [] = K1 e

K2

1
1

(Arrhenius equation)

Evolution of temperature:
c()

= [k()],
t

(with suitable boundary conditions: coolant, initial temperature,. . . )


Philippe Toint (Namur)

April 2009

12 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: biological parameter estimation (1)


K-channel in a the model of a neuron membrane:

Sansom (2001)

Doyle et al. (1998)


Philippe Toint (Namur)

April 2009

13 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: biological parameter estimation (2)


Where are these neurons?

in a Pacific spiny lobster!

Simmers, Meyrand and Moulin (1995)


Philippe Toint (Namur)

April 2009

14 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: biological parameter estimation (3)


After gathering experimental data (applying a current to the cell):
estimate the biological model parameters that best fit experiments
Model:
Activation: p independent gates
Deactivation: nh gates with different dynamics
nh + 2 coupled ODEs for the voltage, the activation level, the partial
inactivations levels
5-points BDF for 50000 time steps
very nonlinear!

Philippe Toint (Namur)

April 2009

15 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: data assimilation for weather forecasting (1)

(Attempt to) predict. . .


tomorrows weather
the oceans average temperature
next month
future gravity field
future currents in the ionosphere
...

Philippe Toint (Namur)

April 2009

16 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: data assimilation for weather forecasting (2)


Data: temperature, wind, pressure, . . . everywhere and at all times!

May involve up to 250000000 variables!


Philippe Toint (Namur)

April 2009

17 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: data assimilation for weather forecasting (3)


The principle:

0.8

Known situation 2.5 days ago


and background prediction

0.6

0.4

0.2

0.2

0.4
2.5

1.5

0.5

0.5

temp. vs. days

Philippe Toint (Namur)

April 2009

18 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: data assimilation for weather forecasting (3)


The principle:

0.8

Known situation 2.5 days ago


and background prediction
Record temperature for the past 2.5 days

0.6

0.4

0.2

0.2

0.4
2.5

1.5

0.5

0.5

temp. vs. days

Philippe Toint (Namur)

April 2009

18 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: data assimilation for weather forecasting (3)


The principle:
Minimize deviation between model and past observations
1

Known situation 2.5 days ago


and background prediction
Record temperature for the past 2.5 days
Run the model to minimize difference
I between model and observations

0.8

0.6

0.4

0.2

0.2

0.4
2.5

1.5

0.5

0.5

temp. vs. days


N

1X
1
min kx0 xb k2B 1 +
kHM(ti , x0 ) bi k2R 1 .
x0 2
2
i
i=0

Philippe Toint (Namur)

April 2009

18 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: data assimilation for weather forecasting (3)


The principle:
Minimize deviation between model and past observations
Known situation 2.5 days ago
and background prediction
Record temperature for the past 2.5 days
Run the model to minimize difference
I between model and observations
Predict temperature for the next day

0.8

0.6

0.4

0.2

0.2

0.4
2.5

1.5

0.5

0.5

temp. vs. days

Philippe Toint (Namur)

April 2009

18 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: data assimilation for weather forecasting (4)


Analysis of the oceans heat content:

CERFACS (2009)

Much better fit!


Philippe Toint (Namur)

April 2009

19 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: aeronautical structure design


minimize weight while maintaining structural integrity

mass reduction during optimization


SAMTECH (2009)
Philippe Toint (Namur)

April 2009

20 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: asteroid trajectory matching


find todays asteroid whose orbital parameters
match best one observed 50 years ago

Milani, Sansaturio et al. (2005)


Philippe Toint (Namur)

April 2009

21 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: discrete choice modelling (1)


Context: simulation of individual choices in Transportation (or other)
(mode, route, time of departure,. . . )
Random utility theory
An individual i assigns to alternative j the utility

Uij = [ parameters explaining factors ] + [ random error ]

Illustration :

Ubus = distance 1.2 price of ticket 2.1 delay wrt to car travel + 
Philippe Toint (Namur)

April 2009

22 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: discrete choice modelling (2)

Probability that individual i chooses alternative j rather than


alternative k given by
prob (Uij Uik for all k)
Data: mobility surveys (MOBEL)

find the parameters in the utility function to


maximize likelihood of observed behaviours

Philippe Toint (Namur)

April 2009

23 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: discrete choice modelling (3)


1

Normal
Lognormal
Spline

0.9
0.8
0.7
0.6
0.5

0.4

Using standard statistics

0.3
0.2

0.1

Using advanced optimization

0
0

10

15

20

25

30

35

40

45

Estimation of the value of time lost in congested trafic


(with and without advanced optimization)

Philippe Toint (Namur)

April 2009

24 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: Poisson image denoising (1)


Consider a two dimensional image with noise proportional to signal
zij = uij + nf (uij )
where n is a random Gaussian noise. How to recover the original uij ?

use the pixel values as much as possible


while minimizing sharp transitions (gradients)

This leads to the optimization problem


min
u

Philippe Toint (Namur)

Z
(uij zij log(uij )) +

ij

kuk

April 2009

25 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: Poisson image denoising (2)


Some spectacular results: a 512 512 picture with 95% noise

Philippe Toint (Namur)

April 2009

26 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: Poisson image denoising (2)


Some spectacular results: a 512 512 picture with 95% noise

Chan and Chen (2007)


Philippe Toint (Namur)

April 2009

26 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: shock simulation in video games


Optimize the realism of the motion of multiple rigid bodies in space
complementarity problem
q [q(t)]v (t) 0
(q(t)) 0
(q(t) = positions, v (t) =

dq
dt (t)

= velocities)

system of inequalities and equalities


used in realtime for video animation
Anitescu and Potra (1996)
Philippe Toint (Namur)

April 2009

27 / 323

Nonlinear optimization: motivation, past and perspectives

Definition and examples

Applications: finance
1

risk management

portofolio analysis

3
1
0.9

FX markets
Normal
Spline

0.8
0.7
0.6
0.5
0.4

Standard

0.3

Optimized

0.2
0.1
0
-1.5

-1

-0.5

0.5

Investment distribution
for the BoJ 1991-2004
4

...

Philippe Toint (Namur)

1.5

Everybody loves
an optimizer!

April 2009

28 / 323

Nonlinear optimization: motivation, past and perspectives

History

Where does optimization come from?


Nous sommes comme des nains juches sur des epaules de geants, de telle
sorte que nous puissions voir plus de choses et de plus eloignees que nen
voyaient ces derniers. Et cela, non point parce que notre vue serait
puissante ou notre taille avantageuse, mais parce que nous sommes portes
et exhausses par la haute stature des geants.
We are like dwarfs standing on the shoulders of giants, such that we can
see more things and further away than they could. And this, not because
our sight would be more powerful or our height more advantageous, but
because we are carried and heigthened by the high stature of the giants.
Bernard de Chartres (1130-1160)

Philippe Toint (Namur)

April 2009

29 / 323

Nonlinear optimization: motivation, past and perspectives

Euclid (300 BC)

Philippe Toint (Namur)

History

Al-Khwarizmi (783-850)

April 2009

30 / 323

Nonlinear optimization: motivation, past and perspectives

Isaac Newton (1642-1727)

Philippe Toint (Namur)

History

Leonhardt Euler (1707-1783)

April 2009

31 / 323

Nonlinear optimization: motivation, past and perspectives

J. de Lagrange (1735-1813)

Philippe Toint (Namur)

History

Friedrich Gauss (1777-1855)

April 2009

32 / 323

Nonlinear optimization: motivation, past and perspectives

History

Augustin Cauchy (1789-1857) George Dantzig (1914-2005)

Philippe Toint (Namur)

April 2009

33 / 323

Nonlinear optimization: motivation, past and perspectives

Michael Powell

Philippe Toint (Namur)

History

Roger Fletcher

April 2009

34 / 323

Nonlinear optimization: motivation, past and perspectives

Basic concepts

Return to the mathematical problem

min f (x)
x

such that
c(x) 0
Difficulties:
the objective function f (x) is typically complicated (nonlinear)
it is also often costly to compute
there may be many variables
the constraints c(x) may defined a complicated geometry

Philippe Toint (Namur)

April 2009

35 / 323

Nonlinear optimization: motivation, past and perspectives

Basic concepts

An example unconstrained problem


minimize : f (, ) = 102 + 10 2 + 4 sin() 2 + 4
3

Two local minima: (2.20, 0.32) and (2.30, 0.34)


How to find them?
Philippe Toint (Namur)

April 2009

36 / 323

Nonlinear optimization: motivation, past and perspectives

Basic concepts

Trust-region methods
iterative algorithms
find local solutions only
Algorithm 1.1: The trust-region framework
Until an (approximate) solution is found:
Step 1: use a model of the nonlinear function(s)
within region where it can be trusted
Step 2: notion of sufficient decrease
Step 3: measure achieved and predicted reductions
Step 4: decrease the region radius if unsuccessful

Philippe Toint (Namur)

April 2009

37 / 323

Nonlinear optimization: motivation, past and perspectives

Illustration

minimize : f (, ) = 102 + 10 2 + 4 sin() 2 + 4

Two local minima: (2.20, 0.32) and (2.30, 0.34)

Philippe Toint (Namur)

April 2009

38 / 323

Nonlinear optimization: motivation, past and perspectives

x0 = (0.71, 3.27)

Illustration

and

f (x0 ) = 97.630

Contours of f

Contours of m0 around x0
(quadratic model)

Philippe Toint (Namur)

April 2009

39 / 323

Nonlinear optimization: motivation, past and perspectives

k
0

k
1

sk
(0.05, 0.93)

Illustration

f (xk + sk )
43.742

f /mk
0.998

xk+1
x0 + s0

Philippe Toint (Namur)

April 2009

40 / 323

Nonlinear optimization: motivation, past and perspectives

k
0
1

k
1
2

sk
(0.05, 0.93)
(0.62, 1.78)

Illustration

f (xk + sk )
43.742
2.306

f /mk
0.998
1.354

xk+1
x0 + s0
x1 + s1

Philippe Toint (Namur)

April 2009

40 / 323

Nonlinear optimization: motivation, past and perspectives

k
0
1
2

k
1
2
4

sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)

Illustration

f (xk + sk )
43.742
2.306
6.295

f /mk
0.998
1.354
0.004

xk+1
x0 + s0
x1 + s1
x2

Philippe Toint (Namur)

April 2009

40 / 323

Nonlinear optimization: motivation, past and perspectives

k
0
1
2
3

k
1
2
4
2

sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)

Illustration

f (xk + sk )
43.742
2.306
6.295
29.392

f /mk
0.998
1.354
0.004
0.649

xk+1
x0 + s0
x1 + s1
x2
x2 + s2

Philippe Toint (Namur)

April 2009

40 / 323

Nonlinear optimization: motivation, past and perspectives

k
0
1
2
3
4

k
1
2
4
2
2

sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
(0.32, 0.15)

Illustration

f (xk + sk )
43.742
2.306
6.295
29.392
31.131

f /mk
0.998
1.354
0.004
0.649
0.857

xk+1
x0 + s0
x1 + s1
x2
x2 + s2
x3 + s3

Philippe Toint (Namur)

April 2009

40 / 323

Nonlinear optimization: motivation, past and perspectives

k
0
1
2
3
4
5

k
1
2
4
2
2
4

sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
(0.32, 0.15)
(0.03, 0.02)

Illustration

f (xk + sk )
43.742
2.306
6.295
29.392
31.131
31.176

f /mk
0.998
1.354
0.004
0.649
0.857
1.009

xk+1
x0 + s0
x1 + s1
x2
x2 + s2
x3 + s3
x4 + s4

Philippe Toint (Namur)

April 2009

40 / 323

Nonlinear optimization: motivation, past and perspectives

k
0
1
2
3
4
5
6

k
1
2
4
2
2
4
8

sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
(0.32, 0.15)
(0.03, 0.02)
(0.02, 0.00)

Illustration

f (xk + sk )
43.742
2.306
6.295
29.392
31.131
31.176
31.179

f /mk
0.998
1.354
0.004
0.649
0.857
1.009
1.013

xk+1
x0 + s0
x1 + s1
x2
x2 + s2
x3 + s3
x4 + s4
x5 + s5

Philippe Toint (Namur)

April 2009

40 / 323

Nonlinear optimization: motivation, past and perspectives

Illustration

Path of iterates:

From another x0 :

Philippe Toint (Namur)

April 2009

41 / 323

Nonlinear optimization: motivation, past and perspectives

Illustration

And then. . .

Does it (always) work?

The answer tomorrow!


(and subsequent days for a (biased) survey of new optimization methods)

Thank you to you for your attention

Philippe Toint (Namur)

April 2009

42 / 323

Trust region methods for unconstrained problems

Philippe Toint (Namur)

April 2009

43 / 323

Trust region methods for unconstrained problems

H(

Lesson 2:
Trust-region methods
for unconstrained problems

Philippe Toint (Namur)

April 2009

43 / 323

Trust region methods for unconstrained problems

The basic text for this course

A. R. Conn, N. I. M. Gould and Ph. L. Toint,


Trust-Region Methods,
Nr 01 in the MPS-SIAM Series on Optimization,
SIAM, Philadelphia, USA, 2000.
Philippe Toint (Namur)

April 2009

44 / 323

Trust region methods for unconstrained problems

Background material

2.1: Background material

Philippe Toint (Namur)

April 2009

45 / 323

Trust region methods for unconstrained problems

Background material

Scalar mean-value theorems


Let S be an open subset of IRn , and suppose f : S IR is
continuously differentiable throughout S. Then, if the segment
x + s S for all [0, 1],
f (x + s) = f (x) + hx f (x + s), si
for some [0, 1].
Let S be an open subset of IRn , and suppose f : S IR is twice
continuously differentiable throughout S. Then, if the segment
x + s S for all [0, 1],
f (x + s) = f (x) + hx f (x), si + 12 hs, xx f (x + s)si
for some [0, 1].
Philippe Toint (Namur)

April 2009

46 / 323

Trust region methods for unconstrained problems

Background material

Vector mean-value theorem

Let S be an open subset of IRn , and suppose F : S IRm is


continuously differentiable throughout S. Then, if the segment
x + s S for all [0, 1],
Z
F (x + s) = F (x) +
0

Philippe Toint (Namur)

x F (x + s)s d.

April 2009

47 / 323

Trust region methods for unconstrained problems

Background material

Taylors scalar approximation theorems (1)

Let S be an open subset of IRn , and suppose f : S IR is


continuously differentiable throughout S. Suppose further that
x f (x) is Lipschitz continuous at x, with Lipschitz constant
(x) in some appropriate vector norm. Then, if the segment
x + s S for all [0, 1],
|f (x + s) m(x + s)| 21 (x)ksk2 ,
where
m(x + s) = f (x) + hx f (x), si.

Philippe Toint (Namur)

April 2009

48 / 323

Trust region methods for unconstrained problems

Background material

Taylors scalar approximation theorems (2)

Let S be an open subset of IRn , and suppose f : S IR is twice


continuously differentiable throughout S. Suppose further that
xx f (x) is Lipschitz continuous at x, with Lipschitz constant
(x) in some appropriate vector norm and its induced matrix
norm. Then, if the segment x + s S for all [0, 1],
|f (x + s) m(x + s)| 61 (x)ksk3 ,
where
m(x + s) =
f (x) + hx f (x), si + 12 hs, xx f (x)si.

Philippe Toint (Namur)

April 2009

49 / 323

Trust region methods for unconstrained problems

Background material

Taylors vector approximation theorem

Let S be an open subset of IRn , and suppose F : S IRm is


continuously differentiable throughout S. Suppose further that
x F (x) is Lipschitz continuous at x, with Lipschitz constant
(x) in some appropriate vector norm and its induced matrix
norm. Then, if the segment x + s S for all [0, 1],
kF (x + s) M(x + s)k 21 (x)ksk2 ,
where
M(x + s) = F (x) + x F (x)s.

Philippe Toint (Namur)

April 2009

50 / 323

Trust region methods for unconstrained problems

Background material

Newtons method

Solve
F (x) = 0
Idea: solve linear approximation
F (x) + J(x)s = 0
quadratic local convergence
. . . but not globally convergent
Yet the basis of everything that follows

Philippe Toint (Namur)

April 2009

51 / 323

Trust region methods for unconstrained problems

Background material

Unconstrained optimality conditions


Suppose that f C 1 , and that x is a local minimizer of f (x).
Then
x f (x ) = 0.
Suppose that f C 2 , and that x is a local minimizer of f (x).
Then the above holds and the objective functions Hessian at
x is positive semi-definite, that is
hs, xx f (x )si 0 for all s IRn .
hs, xx f (x )si > 0 for all s 6= 0 IRn
strict local solution
Philippe Toint (Namur)

April 2009

52 / 323

Trust region methods for unconstrained problems

Background material

Constrained optimality conditions (1)


minimize
subject to
and

f (x)
ci (x) = 0, for i E,
ci (x) 0, for i I,

Active set:
F{2,3}
c2 (x) = 0
r
rx3

F{2}
j

C
F

F{1,2} *
Philippe Toint (Namur)

rx1
6
F{1}

rx2

 F{3}

A(x1 ) = {1, 2}
A(x2 ) =
A(x3 ) = {3}

c1 (x) = 0
r
Y F{1,3}
H
c3 (x) = 0
April 2009

53 / 323

Trust region methods for unconstrained problems

Background material

Constrained optimality conditions (2): first order


Ignore constraint qualification!
Suppose that f , c C 1 , and that x is a local solution. Then
there exist a vector of Lagrange multipliers y such that
X
x f (x ) =
[y ]i x ci (x )
iEI

ci (x ) = 0 for all i E
ci (x ) 0 and [y ]i
and ci (x )[y ]i

0 for all i I
= 0 for all i I.

Lagrangian: `(x, y ) = f (x)

yi ci (x)

iEI
Philippe Toint (Namur)

April 2009

54 / 323

Trust region methods for unconstrained problems

Background material

Constrained optimality conditions (3): second order


Suppose that f , c C 2 , and that x is a local minimizer of f (x).
Then there exist a vector of Lagrange multipliers y such that firstorder conditions hold and
hs, xx `(x , y )si 0 for all s N+
where N+ is the set of vectors s such that
S
T
hs, x ci (x )i = 0 for all i E {j A(x ) I | [y ]j > 0}
and
hs, x ci (x )i 0 for all i {j A(x )

I | [y ]j = 0}

strict complementarity: hs, xx `(x , y )si > 0 for all s N+ (s 6= 0)


strict local solution
Philippe Toint (Namur)

April 2009

55 / 323

Trust region methods for unconstrained problems

Background material

Optimatity conditions (convex 1)

Assume now that C is convex


normal cone of C at x C,
def

N (x) = {y IRn | hy , u xi 0, u C}

tangent cone of C at x C
T (x) = N (x)0 = cl{(u x) | 0 and u C}
def

Philippe Toint (Namur)

April 2009

56 / 323

Trust region methods for unconstrained problems

Background material

Optimality conditions (convex 2)


11111111
00000000
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000000000000000000000000
11111111111111111111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000
000000000000 11111111111111
111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000
000000000000 11111111111111
111111111111
00000000000000
000000000000 11111111111111
111111111111
00000000000000
11111111111111
00000000000000
11111111111111
00000000000000
11111111111111
00000000000000
11111111111111
The Moreau decomposition
Philippe Toint (Namur)

April 2009

57 / 323

Trust region methods for unconstrained problems

Background material

Optimatity conditions (convex 2)

Suppose that C 6= is convex, closed, that f is continuously


differentiable in C, and that x is a first-order critical point for
the minimization of f over C. Then, provided that constraint
qualification holds,
x f (x ) N (x ).

Philippe Toint (Namur)

April 2009

58 / 323

Trust region methods for unconstrained problems

Background material

Conjugate gradients
Idea: minimize a convex quadratic on successive nested Krylov subspaces
Algorithm 2.1: Conjugate-gradients (CG)
Given x0 , set g0 = Hx0 + c and let p0 = g0 .
For k = 0, 1, . . . , until convergence, perform the iteration
k

= kgk k22 /hpk , Hpk i

xk+1 = xk + k pk
gk+1 = gk + k Hpk
k

= kgk+1 k22 /kgk k22

pk+1 = gk+1 + k pk

Philippe Toint (Namur)

April 2009

59 / 323

Trust region methods for unconstrained problems

Background material

Preconditioning
Idea: change the variables x = Rx and define M = R T R.
Algorithm 2.2: Preconditioned CG
Given x0 , set g0 = Hx0 + c, and let v0 = M 1 g0 and p0 = v0 .
For k = 0, 1, . . . , until convergence, perform the iteration
k

= hgk , vk i/hpk , Hpk i

xk+1 = xk + k pk
gk+1 = gk + k Hpk
vk+1 = M 1 gk+1
k

= hgk+1 , vk+1 i/hgk , vk i

pk+1 = vk+1 + k pk

Philippe Toint (Namur)

April 2009

60 / 323

Trust region methods for unconstrained problems

Background material

Lanczos method
Idea: compute an orthonormal basis of the successive nested Krylov
subspaces
makes QkT HQk tridiagonal
Algorithm 2.3: Lanczos
Given r0 , set y0 = r0 , q1 = 0.
For k = 0, 1, . . ., perform the iteration,
k

= kyk k2

qk

= yk /k

= hqk , Hqk i

yk+1 = Hqk k qk k qk1

Philippe Toint (Namur)

April 2009

61 / 323

Trust region methods for unconstrained problems

Background material

Another view on the Conjugate-Gradients method

Conjugate Gradients = Lanczos + LDLT (Cholesky)

Lanczos

Cholesky

{z

Conjugate gradients in one of the Krylov subspaces

Philippe Toint (Namur)

April 2009

62 / 323

Trust region methods for unconstrained problems

The Trust-region algorithm

2.2: The trust-region algorithm

Philippe Toint (Namur)

April 2009

63 / 323

Trust region methods for unconstrained problems

The Trust-region algorithm

The trust-region idea

use a model of the objective function


define a trust-region where it is thought adequate
Bk = {x IRn | kx xk kk k }
find a trial point by sufficiently decreasing the model in Bk
compute the objective function at the trial point
compare achived vs. predicted reductions
reduce k if unsatisfactory

Philippe Toint (Namur)

April 2009

64 / 323

Trust region methods for unconstrained problems

The Trust-region algorithm

The basic trust-region algorithm


Algorithm 2.4: Basic trust-region algorithm (BTR)
Step 0: Initialization. x0 and 0 given, compute f (x0 ) and set k = 0.
Step 1: Model definition. Choose k kk and define a model mk in Bk .
Step 2: Step calculation. Compute sk that sufficiently reduces the
model mk with xk + sk Bk .
Step 3: Acceptance of the trial point. Compute f (xk + sk ) and define
k =

f (xk ) f (xk + sk )
.
mk (xk ) mk (xk + sk )

If k 1 , then define xk+1 = xk + sk ; otherwise define xk+1 = xk .

Step 4: Trust-region radius update.


k+1

[k , )
[2 k , k ]

[1 k , 2 k ]

if k 2 ,
if k [1 , 2 ),
if k < 1 .

Increment k by one and go to Step 1.


Philippe Toint (Namur)

April 2009

65 / 323

Trust region methods for unconstrained problems

Basic convergence theory

2.3: Basic convergence theory

Philippe Toint (Namur)

April 2009

66 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Assumptions
f C2
f (x) lbf
kxx f (x)k ufh
mk C 2 (Bk )
mk (xk ) = f (xk )
def

gk = x mk (xk ) = x f (xk )
kxx mk (x)k umh 1 for all x Bk
1
une kxkk

kxk une kxkk

. . . but use k kk = k k2 in what follows!


Philippe Toint (Namur)

April 2009

67 / 323

Trust region methods for unconstrained problems

Basic convergence theory

The Cauchy step


Idea: minimize mk on the Cauchy arc
def

xkC (t) = {x | x = xk tgk , t 0 and x Bk }.


4

3.5

2
2.5
1
2
0
1.5
1
1
2

0.5

4
4

1.5

2.5

3.5

4.5

the Cauchy point


Philippe Toint (Namur)

April 2009

68 / 323

Trust region methods for unconstrained problems

Basic convergence theory

The Cauchy point for quadratic models


Three cases when minimizing the quadratic mk along the Cauchy arc:
1

1
0

4
10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

15

0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1

0.1

0.2

kgk k
mk (xk ) mk (xk ) kgk k min
, k
k
C

Philippe Toint (Namur)

1
2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

April 2009

69 / 323

Trust region methods for unconstrained problems

Basic convergence theory

The Cauchy point for general models


Three cases when minimizing the general mk along the Cauchy arc:
1

1
0

4
10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

15

0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1

0.1

0.2

0.3

kgk k
mk (xk ) mk (xk ) dcp kgk k min
, k
k
AC

Philippe Toint (Namur)

0.4

0.5

0.6

0.7

0.8

0.9

April 2009

70 / 323

Trust region methods for unconstrained problems

Basic convergence theory

The meaning of sufficient decrease


In both cases, we get:
Sufficient decrease condition:

kgk k
, k ,
mk (xk ) mk (xk + sk ) mdc kgk k min
k


Immediate consequence:
Suppose that x f (xk ) 6= 0. Then mk (xk + sk ) < mk (xk ) and
sk 6= 0.
k is well defined!

Philippe Toint (Namur)

April 2009

71 / 323

Trust region methods for unconstrained problems

Basic convergence theory

The exact minimizer is OK


Suppose that, for all k, sk ensures that
mk (xk ) mk (xk + sk )amm [mk (xk ) mk (xkM )],
Then sufficient decrease is obtained.
1.5

80

60
45

30
20

5
5.2

0.5

2
1

5.2

1
2
5.2

0.5

5
20
30

45
60
1.5
1.5

Philippe Toint (Namur)

0.5

0.5

1.5

April 2009

72 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Taylor and minimum radius


For all k,

|f (xk + sk ) mk (xk + sk )| ubh 2k ,

Suppose that gk 6= 0 and that


k

mdc kgk k(1 2 )


.
ubh

Then iteration k is very successful and


k+1 k .
Suppose that kgk k lbg > 0 for all k. Then is a constant
lbd > 0 such that, for all k
k lbd .
Philippe Toint (Namur)

April 2009

73 / 323

Trust region methods for unconstrained problems

Basic convergence theory

First-order convergence (1)

Suppose that there are only finitely many successful iterations.


Then xk = x for all sufficiently large k and x is first-order
critical.
Suppose that there are infinitely many successful iterations.
Then
lim inf kx f (xk )k = 0.
k

idea: infinite descent if not critical

Philippe Toint (Namur)

April 2009

74 / 323

Trust region methods for unconstrained problems

Basic convergence theory

First-order convergence (2)


Suppose that there are infinitely many successful iterations.
Then lim kx f (xk )k = 0.
k

kgk k s6
s

For 1 >0

s
s s

2

s
s

s
s

{ti }
{`i }

s
s
s

s
s

S
K

Philippe Toint (Namur)

s s

s
s

s
-k
April 2009

75 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Convex models (1)


Suppose that min [xx mk (x)]  for all x [xk , xk + sk ] and
for some  > 0. Then
2
ksk k kgk k.

idea: mk curves upwards!
Suppose that {xki } x and x is first-order critical, and that
there is a constant smh > 0 such that
min min [xx mk (x)] smh

xBk

whenever xk is sufficiently close to x Suppose finally that


xx f (x ) is nonsingular. Then the complete sequence of iterates {xk } converges to x .
idea: steps too short to escape local basin
Philippe Toint (Namur)

April 2009

76 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Convex models (2)

But. . .
4

4
4

Philippe Toint (Namur)

4
4

April 2009

77 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Asymptotically exact Hessians


Assume also that
lim kxx f (xk ) xx mk (xk )k = 0 whenever

lim kgk k = 0

Suppose that {xki } x and x is first-order critical, that


sk 6= 0 for all k sufficiently large, and that xx f (x ) is positive
definite. Then the complete sequence of iterates {xk } converges to x , all iterations are eventually very successful and
the trust-region radius k is bounded away from zero.
idea: sufficient decrease implies that
mk (xk ) mk (xk + sk ) mqd ksk k2 > 0.
Then k 1.
Philippe Toint (Namur)

April 2009

78 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Second order: the eigen point


Assume 0 > k (Hk ).
Then fine the eigen direction uk such that
huk , gk i 0, kuk kk = k huk , Hk uk i snc k 2k ,
Minimize the model along uk to compute the eigen point:
mk (xkE ) = mk (xk + tkE uk ) = min mk (xk + tuk )
t(0,1]

1.5

0.5

0.5

1.5

2
2

Philippe Toint (Namur)

1.5

0.5

0.5

1.5

April 2009

79 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Model decrease at the eigen point


Suppose: 0 > k (Hk ), uk is an eigen direction and
kxx mk (x) xx mk (y )k lch kx y k
for all x, y Bk . Then
mk (xk ) mk (xkE ) sod k min[k2 , 2k ].
(quadratic or general model)
1

9
10

0.1

0.2

0.3

Philippe Toint (Namur)

0.4

0.5

0.6

0.7

0.8

0.9

10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

April 2009

80 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Second order: convergence theorems


lim sup min [xx f (xk )] 0.
k

Suppose that x is an isolated limit point of the sequence of


iterates {xk }. Then x is a second-order critical point.
Assume also that, for 3 >1,
k 2 and k max k+1 [3 k , 4 k ]
Let x be any limit point of the sequence of iterates. Then x
is a second-order critical point.

Philippe Toint (Namur)

April 2009

81 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Different trust-region norms

1.5

1.5

1.5

0.5

0.5

0.5

0.5

0.5

0.5

1.5

1.5

1.5

2
2

1.5

0.5

0.5

Philippe Toint (Namur)

1.5

2
2

1.5

0.5

0.5

1.5

2
2

1.5

0.5

0.5

April 2009

1.5

82 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Using norms for scaling


Idea: change the variables
Sk w = s
Then

def

mkS (xk + w ) f (xk + Sk w ) = f S (w ),


BkS = {xk + w | kw k k }.
mkS (xk ) = f (xk ),

gkS = w f S (0) = SkT x f (xk )

HkS ww f S (0) = SkT xx f (xk )Sk .


Thus
mkS (xk + w ) =
=
=
=
=
Philippe Toint (Namur)

f (xk ) + hgkS , w i + 12 hw , HkS w i


f (xk ) + hSkT x f (xk ), w i + 12 hw , SkT Hk Sk w i
f (xk ) + hx f (xk ), Sk w i + 12 hSk w , Hk Sk w i
f (xk ) + hx f (xk ), si + 12 hs, Hk si
mk (xk + s)
April 2009

83 / 323

Trust region methods for unconstrained problems

Basic convergence theory

Scaling: the geometry

3.5

3.5

3.5

2.5

2.5

2.5

1.5

1.5

1.5

0.5
0.5

1.5

2.5

Philippe Toint (Namur)

3.5

0.5
0.5

1.5

2.5

3.5

0.5
0.5

1.5

2.5

April 2009

3.5

84 / 323

Trust region methods for unconstrained problems

Solving the subproblem

2.4: Solving the subproblem

Philippe Toint (Namur)

April 2009

85 / 323

Trust region methods for unconstrained problems

Solving the subproblem

The subproblem

Assume
Euclidean norm
quadratic model (possibly non-convex)
(drop the index k)
min

sIRn

q(s) hg , si + 21 hs, Hsi

subject to ksk2

Philippe Toint (Namur)

April 2009

86 / 323

Trust region methods for unconstrained problems

Solving the subproblem

Possible approaches

exact minimization
truncated conjugate-gradients
CG + Lanczos (GLTR)
doglegs
eigenvalue based methods
(projection methods)

Philippe Toint (Namur)

April 2009

87 / 323

Trust region methods for unconstrained problems

Solving the subproblem

The exact minimizer

Any global minimizer of q(s) subject to ksk2 = satisfies the


equation
H(M )s M = g ,
where
def

H(M ) = H + M I is positive semi-definite,


M 0 and
M (ks M k2 ) = 0.
If H(M ) is positive definite, s M is unique.
Note: M is the Lagrange multiplier

Philippe Toint (Namur)

April 2009

88 / 323

Trust region methods for unconstrained problems

Solving the subproblem

The exact minimizer: a geometrical view


4

4
4

4
4

4
4

Philippe Toint (Namur)

4
4

April 2009

89 / 323

Trust region methods for unconstrained problems

Solving the subproblem

Finding the exact minimizer


Eigenvalue decomposition of H:
H = U T U
where 1 2 n . Characterization implies that
M 1
Suppose that > 1 and define
s() = H()1 g = U T ( + I )1 Ug
New formulation (one dimensional):
ks()k2
ks()k22 = kU T ( + I )1 Ug k22 = k( + I )1 Ug k22 =
where i = [Ug ]i .
Philippe Toint (Namur)

n
X
i=1

i2
(i + )2

April 2009

90 / 323

Trust region methods for unconstrained problems

Solving the subproblem

The convex case


30

ks()k2
25

@
I
@
20

@
@
@ solution curve





15

10








0

?
6

Philippe Toint (Namur)

April 2009

91 / 323

Trust region methods for unconstrained problems

Solving the subproblem

A nonconvex case
30

ks()k2
25

20

15

10

0
4

Philippe Toint (Namur)

April 2009

92 / 323

Trust region methods for unconstrained problems

Solving the subproblem

The hard case: 1 = 0


30

ks()k2
25

20


15

10

no root larger than 2


5

root near 2.2




0
4

Philippe Toint (Namur)

April 2009

93 / 323

Trust region methods for unconstrained problems

Solving the subproblem

Near the hard case: 1 0


30

ks()k2

=1
= 0.1
= 0.01
= 0.00001

25

20

15

10

Philippe Toint (Namur)

April 2009

94 / 323

Trust region methods for unconstrained problems

Solving the subproblem

The secular equation


Idea: consider the secular equation
def

() =
Then

1/ks()k2

1
1

=0
ks()k2

2.5

6
2

=1
= 0.1
= 0.01
= 0.00001

1.5

0.5
2

apply Newtons method to () = 0 : + = ()/0 ()


Philippe Toint (Namur)

April 2009

95 / 323

Trust region methods for unconstrained problems

Solving the subproblem

The derivatives of ()

Suppose g 6= 0. Then
() is strictly increasing ( > 1 ), and concave.
0 () =

hs(), s()i
ks()k32

where
s() = H()1 s().
Note: if H() = LLT and Lw = s(), then
hs(), s()i = hs(), LT L1 s()i = kw k2

Philippe Toint (Namur)

April 2009

96 / 323

Trust region methods for unconstrained problems

Solving the subproblem

Newtons method on the secular equation


Algorithm 2.5: Exact trust-region solver
Let > 1 and > 0 be given.
1

Factorize H() = LLT .

Solve LLT s = g .

Solve Lw = s.


Replace by +

ksk2




ksk22
.
kw k22

But . . . more complications due to


bracketing the root (initial + update)
termination rule
may be preconditioned
More (1978), More-Sorensen (1983), Dollar-Gould-Robinson (2009)
Philippe Toint (Namur)

April 2009

97 / 323

Trust region methods for unconstrained problems

Solving the subproblem

Approximate solution by truncated CG

Fact: CG never reenters the `2 trust-region


3.5

3.5

2.5

2.5

1.5

1.5

1
1

0.5

0.5

1.5

2.5

3.5

4.5

1.5

2.5

3.5

4.5

4
4

May be preconditioned
Steihaug (1983), T. (1981)

Philippe Toint (Namur)

April 2009

98 / 323

Trust region methods for unconstrained problems

Solving the subproblem

Approximate solution by the GLTR


ST might hit the boundary for steepest descent step sometimes slow

Idea: solve the subproblem on the nested Krylov subspaces


Algorithm 2.6: Two-phase GLTR algorithm
as long as interior: conjugate-gradients
on the boundary: Lanczos method + subproblem solution in
Krylov space

(smooth transition)
Gould-Lucidi-Roma-T. (1999)
Philippe Toint (Namur)

April 2009

99 / 323

Trust region methods for unconstrained problems

Solving the subproblem

Doglegs
Idea: use steepest descent and the full Newtonstep (requires convexity?)
dogleg curve

sD

sN

@
R
@

sC

s DD

@
I
@
@
double-dogleg curve
 trust-region boundary

Powell (1970), Dennis-Mei (1979)


Philippe Toint (Namur)

April 2009

100 / 323

Trust region methods for unconstrained problems

Solving the subproblem

An eigenvalue approach
Rewrite
(H + M)s = g
as


(H g )

s
1


= Ms

or (introducing the parameter )




H
gT



s
1


= ()

M 0
0 1



s
1

choose such that


0,
H + M positive semi-definite
(kskM ) = 0
Philippe Toint (Namur)

Rendl-Wolkowicz (1997), Rojas-Santos-Sorensen (1999)


April 2009

101 / 323

Trust region methods for unconstrained problems

Bibliography

Bibliography for lesson 2 (1)


J. E. Dennis and H. H. W. Mei,
Two New Unconstrained Optimization Algorithms Which Use Function and Gradient Values,
Journal of Optimization Theory and Applications, 28(4):453-482, 1979.
H. S. Dollar, N. I. M. Gould and D. P. Robinson,
On solving trust-region and other regularised subproblems in optimization,
Rutherford-Appleton Laboratory, Chilton, UK, Report RAL-TR-2009-003, 2009.
S. M. Goldfeldt, R. E. Quandt and H. F. Trotter,
Maximization by quadratic hill-climbing,
Econometrica, 34:541-551, 1966.
N. I. M. Gould, S. Lucidi, M. Roma and Ph. L. Toint,
Solving the trust-region subproblem using the Lanczos method,
SIAM Journal on Optimization, 9(2):504-525, 1999.
K. Levenberg,
A Method For The Solution Of Certain Problems In Least Squares,
Quarterly Journal on Applied Mathematics, 2:164168, 1944.
D. Marquardt,
An Algorithm For Least-Squares Estimation Of Nonlinear Parameters,
SIAM Journal on Applied Mathematics, 11:431-441, 1963.
J. J. Mor
e,
The Levenberg-Marquardt algorithm: implementation and theory,
Numerical Analysis, Dundee 1977 (A. Watson, ed.), Springer Verlag, Heidelberg, 1978.
J. J. Mor
e,
Recent developments in algorithms and software for trust region methods,
in Mathematical Programming: The State of the Art (A. Bachem, M. Gr
otschel and B. Korte, eds.), Springer Verlag,
Heidelberg, pp. 258-287, 1983.
J. J. Mor
e and D. C. Sorensen,
Computing A Trust Region Step,
SIAM Journal on Scientific and Statistical Computing, 4(3):553-572, 1983.
Philippe Toint (Namur)

April 2009

102 / 323

Trust region methods for unconstrained problems

Bibliography

Bibliography for lesson 2 (2)

D. D. Morrison,
Methods for nonlinear least squares problems and convergence proofs,
Proceedings of the Seminar on Tracking Programs and Orbit Determination, (J. Lorell and F. Yagi, eds.), Jet Propulsion
Laboratory, Pasadena, USA, pp. 1-9, 1960.
M. J. D. Powell,
A New Algorithm for Unconstrained Optimization,
in Nonlinear Programming (J. B. Rosen, O. L. Mangasarian and K. Ritter, eds.), Academic Press, London, pp. 31-65,
1970.
F. Rendl and H. Wolkowicz,
A Semidefinite Framework for Trust Region Subproblems with Applications to Large Scale Minimization,
Mathematical Programming, 77(2):273-299, 1997.
M. Rojas, S. A. Santos and D. C. Sorensen,
A new matrix-free algorithm for the large-scale trust-region subproblem,
CAAM, Rice University, TR99-19, 1999.
D. Winfield,
Function and functional optimization by interpolation in data tables,
Ph.D. Thesis, Harvard University, Cambridge, USA, 1969.

Philippe Toint (Namur)

April 2009

103 / 323

Derivative free optimization, filters and other topics

Lesson 3:
Derivative-free optimization,
infinite dimensions and filters

Philippe Toint (Namur)

April 2009

104 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

3.1: Derivative-free optimization

Philippe Toint (Namur)

April 2009

105 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

An application of trust-regions: unconstrained DFO


Consider the unconstrained problem
min f (x)
x

Gradient (and Hessian) of f (x) unavailable


physical measurement
object code
typically small-scale (but not always. . . )
Derivative free optimization (DFO)
f (x) typically very costly
Exploit each evaluation of f (x) to the utmost possible
considerable interest of practitioners
Philippe Toint (Namur)

April 2009

106 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation methods for DFO


Idea: Winfield (1973), Powell (1994)
Until convergence:
Use the available function values to build a polynomial
interpolation model mk :
mk (yi ) = f (yi )

yi Y ;

Minimize the model in a trust region, yielding a new


potentially good point;
Compute a new function value.
Y = interpolation set { points yi at which f (yi ) is known }
Philippe Toint (Namur)

April 2009

107 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

A naive trust-region method for DFO: illustration

50
45
40
35
30
25

20

15

10
1
5
2

0
3

2.5

Philippe Toint (Namur)

1.5

0.5

0.5

3
1

1.5

April 2009

108 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

A naive trust-region method for DFO: illustration

50
45
40
35
30
25

20

15

10
1
5
2

0
3

2.5

Philippe Toint (Namur)

1.5

0.5

0.5

3
1

1.5

April 2009

108 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

A naive trust-region method for DFO: illustration

50
45
40
35
30
25

20

15

10
1
5
2

0
3

Philippe Toint (Namur)

2.5

1.5

0.5

0.5

3
1

1.5

April 2009

108 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

A naive trust-region method for DFO: illustration

50
45
40
35
30
25

20

15

10
1
5
2

0
3

Philippe Toint (Namur)

2.5

1.5

0.5

0.5

3
1

1.5

April 2009

108 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation methods for DFO (2)

To be considered:
poisedness of the interpolation set Y
choice of models (linear, quadratic, in between, beyond)
convergence theory
numerical performance

Philippe Toint (Namur)

April 2009

109 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Poisedness

Assume a quadratic model


mk (xk + s) = fk + hgk , si + 12 hs, Hk si
Thus
p = 1 + n + 12 n(n + 1) = 12 (n + 1)(n + 2)
parameters to determine need p function values (|Y | = p)
Not sufficient!
need geometric conditions for the points in Y . . .

Philippe Toint (Namur)

April 2009

110 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Poisedness: geometry with n = 2, p = 6

20
18
16
14
12
10
8
6
4

2
0
2

1
0
1.5

0.5

1
0.5

1.5

With these 6 data points in IR3 . . . ...


Philippe Toint (Namur)

April 2009

111 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Poisedness: geometry with n = 2, p = 6

. . . is this the correct interpolation?


Philippe Toint (Namur)

April 2009

111 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Poisedness: geometry with n = 2, p = 6

. . . or this?
Philippe Toint (Namur)

April 2009

111 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Poisedness: geometry with n = 2, p = 6

. . . or this?
Philippe Toint (Namur)

April 2009

111 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Poisedness: geometry with n = 2, p = 6

The difference ... is zero on a quadratic curve containing Y !


Philippe Toint (Namur)

April 2009

111 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Poisedness: geometry (2)


If {i ()}pi=1 = basis for quadratic polynomials
p
X

i i (yj ) = f (yj )

j = 1, . . . , p

i=1

Possible poisedness measure:

1 (y1 )

..
(Y ) = det
.
1 (yp )

p (y1 )

..

.
p (yp )

Y (well) poised |(Y )| 


scale for the spread of the yi s
notion of geometry improvement
Philippe Toint (Namur)

April 2009

112 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Lagrange polynomials
Remarkable: replace y by y+ in Y :

(Y+ )
= L(y+ , y ) is independent of the basis {i ()}pi=1
(Y )

where

y Y

L(y , y ) =

1
0

if y = y
if y =
6 y

is the Lagrange fundamental polynomial


Note: for quadratic interpolation, L(, y ) is a quadratic polynomial!
Powell (1994)
Philippe Toint (Namur)

April 2009

113 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation using Lagrange polynomials


Idea: use the Lagrange polynomials to define the (quadratic) interpolant
by
mk (xk + s) =

f (y )Lk (xk + s, y )

y Yk

And then. . .

kf (xk + s) mk (xk + s)k

kxk + s y k2 |Lk (xk + s, y )|

y Yk

Philippe Toint (Namur)

April 2009

114 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation using Lagrange polynomials: construction

The original function. . .


Philippe Toint (Namur)

April 2009

115 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation using Lagrange polynomials: construction

. . . and the interpolation set


Philippe Toint (Namur)

April 2009

115 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation using Lagrange polynomials: construction

The first Lagrange polynomial


Philippe Toint (Namur)

April 2009

115 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation using Lagrange polynomials: construction

The second Lagrange polynomial


Philippe Toint (Namur)

April 2009

115 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation using Lagrange polynomials: construction

The third Lagrange polynomial


Philippe Toint (Namur)

April 2009

115 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation using Lagrange polynomials: construction

The fourth Lagrange polynomial


Philippe Toint (Namur)

April 2009

115 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation using Lagrange polynomials: construction

The fifth Lagrange polynomial


Philippe Toint (Namur)

April 2009

115 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation using Lagrange polynomials: construction

The sixth Lagrange polynomial


Philippe Toint (Namur)

April 2009

115 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Interpolation using Lagrange polynomials: construction

The final interpolating quadratic


Philippe Toint (Namur)

April 2009

115 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Other algorithmic ingredients


include a new point in the interpolation set
need to drop an existing interpolation point?
select which one to drop: make Y as poised as possible

Note: model/function minimizer may produce bad geometry!!


geometry improvement procedure . . .
trust-region radius management
trust region = Bk = {xk + s | ksk k }
standard: reduce k when no progress
DFO: more complicated! (Could reduce to fast and prevent
convergence. . . )

verify that Y is poised before reducing k


Philippe Toint (Namur)

April 2009

116 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Improving the geometry in a ball

attempt to reuse past points that are close to xk


attempt to replace a distant point of Y
attempt to replace a close point of Y
good geometry for the current k improvement impossible
Philippe Toint (Namur)

April 2009

117 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Self-correction at unsuccessful iterations (1)

At iteration k, define the set of exchangeable far points:

Fk = {y Yk | ky xk k > k and Lk (xk + sk , y ) 6= 0}


and the set of exchangeable close points (for some > 1):

Ck = {y Yk \{xk } | ky xk k k and |Lk (xk +sk , y )| }

Philippe Toint (Namur)

April 2009

118 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Self-correction at unsuccessful iterations (2)


Remarkably,
Whenever
iteration k is unsuccessful,
Fk =
k is small w.r.t. kgk k,
then Ck 6= .
(an improvement of the geometry by a factor is always possible at
unsuccessful iterations when k is small and all exchangeable far points
have been considered)
no need to reduce k forever!

Philippe Toint (Namur)

April 2009

119 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Trust-region algorithm for DFO (1)


Algorithm 3.1: TR for DFO
Step 0: Initialization. Given: x0 , 0 , Y0 ( L0 (, y )). Set k = 0.
Step 1: Criticality test [complicated and not discussed here]
Step 2: Solve the subproblem. Compute sk that sufficiently reduces mk (xk + s)
within the trust region,
Step 3: Evaluation.

Compute f (xk + sk ) and


k =

f (xk ) f (xk + sk )
.
mk (xk ) mk (xk + sk )

Step 4: Define the next iterate and interpolation set.


the big question
Step 5: Update the Lagrange polynomials.

Philippe Toint (Namur)

April 2009

120 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Trust-region algorithm for DFO (2)


Algorithm 3.2: Step 4: Define xk+1 and Yk+1
Step 4a: Successful iteration. If k 1 , accept
xk + sk , increase k and exchange xk + sk with
y = arg max ky (xk + sk )k2 |Lk (xk + sk , y )|
y Yk

Step 4b: Replace far point. If k < 1 (+ other technical condition) and Fk 6= , reject
xk + sk , keep k and exchange xk + sk with
y = arg max ky (xk + sk )k2 |Lk (xk + sk , y )|
y Fk

Step 4c: Replace close point. If k < 1 (+ other technical condition) and Ck 6= , reject
xk + sk , keep k and exchange xk + sk with
y = arg max ky (xk + sk )k2 |Lk (xk + sk , y )|
y Ck

Step 4d: Decrease the radius. Otherwise, reject xk + sk , keep Yk , and reduce k .
Philippe Toint (Namur)

April 2009

121 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

Global convergence results


If the model is at least fully linear, then
lim inf kx f (xk )k = lim inf kgk k = 0
k

Scheinberg and T. (2009)

With more costly algorithm:


If the model is at least fully linear, then
lim kx f (xk )k = lim kgk k = 0

If the model at least fully quadratic, then iterates converge to


2nd-order critical points
Philippe Toint (Namur)

April 2009

122 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

For an efficient numerical method. . .


Many more issues:
which Hessian approximation?
(full/vs diagonal or structured)
details of criticality tests difficult
details for numerically handling interpolation polynomials
(Lagrange, Newton),
reference shifts,
...
good codes around: NEWUOA, DFO efficient solvers
Powell (2008 and previously), Conn, Scheinberg and T. (1998)
Conn, Scheinberg and Vicente (2008)

Philippe Toint (Namur)

April 2009

123 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Derivative free optimization

On the ever famous banana function. . .

25
20
15
10
5
0
5
10
1.5
1.2

1
0.8
0.6

0.5
0.4
0

Philippe Toint (Namur)

0.2
0

April 2009

124 / 323

Derivative free optimization, filters and other topics

Infinite dimensional problems

3.2: Infinite dimensional problems

Philippe Toint (Namur)

April 2009

125 / 323

Derivative free optimization, filters and other topics

Infinite dimensional problems

Why consider infinite dimensions?

Main motivation:

large-scale finite dimensional problems often result from discretized


continuous ones (surfaces, time-trajectories, optimal control, . . . )
behaviour on these problems dominated by infinite dimensional
properties
Need to investigate infinite dimensions to ensure consistency!
Two main cases: Hilbert and Banach spaces.

Philippe Toint (Namur)

April 2009

126 / 323

Derivative free optimization, filters and other topics

Infinite dimensional problems

Convergence in Hilbert spaces

The trust-region algorithm is well-defined and globally


convergent in Hilbert spaces.
Riescz representation theorem V 0 V
Cauchy point results from one dimensional minimization
(but xkM may not exist!)
def

k = 1 + sup kxx mk (x)kV,V 0 ,


xBk
def

min [H] =

Philippe Toint (Namur)

hd, Hdi
dV,d6=0 hd, di
inf

April 2009

127 / 323

Derivative free optimization, filters and other topics

Infinite dimensional problems

Wht happens in Banach spaces ?


Problem: dual space different from the primal!

Need further assumptions:


x f (x) V for all x V.
x f is uniformly continuous from V to V.
For every x {x V | f (x) f (x0 )},
hx f (x), x f (x)i (kx f (x)kV 0 )kx f (x)kV ,
for some continuous monotonically increasing real from
[0, ] to itself, independent of x and such that (0) = 0
and (t) > 0 for t > 0.

Philippe Toint (Namur)

April 2009

128 / 323

Derivative free optimization, filters and other topics

Infinite dimensional problems

Convergence in Banach spaces, nevertheless


The last assumption implies
hgk , gk i (kgk kV 0 )kgk kV
. . . and sufficient decrease follows!
Is this realistic?
The additional assumptions always hold for V = Lp () and
2 p < , when kg kLp () ubg .
Under these additional assumptions, the trust-region algorithm
is well-defined and globally convergent in Banach spaces.

Philippe Toint (Namur)

April 2009

129 / 323

Derivative free optimization, filters and other topics

Filter algorithms

3.3: Filter algorithms

Philippe Toint (Namur)

April 2009

130 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Monotonicity (1)

Global convergence theoretically ensured by


some global measure. . .
unconstrained : f (xk )
(constrained : some merit function at xk )

. . . with strong monotonic behaviour (Lyapunov function)


Also practically enforced by
algorithmic safeguards around Newton method
(linesearches, trust regions)

Philippe Toint (Namur)

April 2009

131 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Monotonicity (2)
But, unfortunately,
classical safeguards limit efficiency!
Of interest: design less obstructive safeguards while
ensuring better numerical performance
(the Newton Liberation Front!)
continuing to guarantee global convergence properties
Is this possible?
Typically:
abandon strict monotonicity of usual measures
but insist on average behaviour instead
Philippe Toint (Namur)

April 2009

132 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Non-monotone trust-regions
Idea: f (xk+1 ) < f (xk ) replaced by f (xk+1 ) < fr (k)
with
fr (k) < fr (k1)
Further issues:
suitably define the reference iteration r (k)
adapt the trust-region algorithm: also compare achieved and
predicted reductions since reference iteration
T. (1997)

Philippe Toint (Namur)

April 2009

133 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Non-monotone TR algorithm
Algorithm 3.3: Non monotone TR algorithm (NMTR)
Step 0: Initialization. Given: x0 , 0 , 1 , 2 , 1 , 2 . Compute f (x0 ), set k = 0.
Step 1: Model definition. Choose k kk and define mk in Bk .
Step 2: Step calculation. Compute sk that sufficiently reduces mk and xk + sk Bk .
Step 3: Acceptance of the trial point.
compute f (xk + sk ),
kh =

Define the reference iteration r (k) k and


k1
X

[mi (xi ) mi (xi + si )],

i=r (k)

Define

iS

f (xr (k) ) f (xk + sk )


f (xk ) f (xk + sk )
k = max h
,
. .
k + mk (xk ) mk (xk + sk ) mk (xk ) mk (xk + sk )

If k 1 , then define xk+1 = xk + sk ; otherwise define xk+1 = xk .


Step 4: Trust-region radius update. Set
8
< [k , )
[2 k , k )
k+1
:
[1 k , 2 k ]
Increment k by one and go to Step 1.
Philippe Toint (Namur)

if k 2 ,
if k [1 , 2 ),
if k < 1 .
April 2009

134 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Sufficient decrease for NMTR


k
X

f (xp(k) ) f (xk+1 ) 1 mdc

kgj k min

j=p(k),jS

kgj k
, j
j

with p(k) = r (k) when hk ck , or p(k) = k otherwise

r6

f (xk )

r
r
6 r
r r
r

r
r
r

r
6 r
r 6
r

r
r
r

r r
6

f (x0 ) f (xk+1 ) 1 mdc

k
X
t=0,tS

Philippe Toint (Namur)


kgt k min

r
r r

r
-k


kgt k
, t .
t
April 2009

135 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Choosing the reference iteration (1)

Algorithm 3.4: Choosing r (k)


Step 3: Acceptance of the trial point.
Step 3a: update the iterate. Compute f (xk + sk ) and set

fr f (xk + sk )
f (xk ) f (xk + sk )
.
k = max
,
r + mk (xk ) mk (xk + sk ) mk (xk ) mk (xk + sk )
If k < 1 , then xk+1 = xk and go to Step 4; otherwise xk+1 = xk + sk and
c = c + mk (xk ) mk (xk+1 ) and r = r + mk (xk ) mk (xk+1 )
Step 3b: update the best value. If f (xk+1 ) < fmin then set fc = fmin = f (xk+1 ),
c = 0 and ` = 0 and go to Step 4; otherwise, ` ` + 1.
Step 3c: update the reference candidate. If f (xk+1 ) > fc , set fc = f (xk+1 ) and
c = 0.
Step 3d: possibly reset the reference value. If ` = m, set fr = fc and r = c .

Philippe Toint (Namur)

April 2009

136 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Choosing the reference iteration (2): example with m = 2

6
r

f (xk )

dr

r
r
dr

r
dr

r
r
r

r
r

dr

r
r

r
r

rd

rd

r
r

r

r
r

r
rd
-k

0 0 0 0 1 2 0 1 0 1 1 1 2 1 2 0 1 1 1 1 2 0 1 1 2 0 0 1 2 0

: reference iteration
: new best value
? : reference iteration redefined (l = m)
Philippe Toint (Namur)

April 2009

137 / 323

Derivative free optimization, filters and other topics

Filter algorithms

An unconstrained example
2

-2

-4

-6

-8

-10

-12
0

10

20

30

40

50

60

70

Monotone and non-monotone TR (using LANCELOT B) on EXTROSNB


Philippe Toint (Namur)

April 2009

138 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Introducing the filter

A fruitful alternative: filter methods


Constrained optimization :
using the SQP step, at the same time:
reduce the objective function f (x)
reduce constraint violation (x)
CONFLICT

Philippe Toint (Namur)

April 2009

139 / 323

Derivative free optimization, filters and other topics

Filter algorithms

The filter point of view


Fletcher and Leyffer replace question:
What is a better point?
by:
What is a worse point?
Of course, y is worse than x when
f (x) f (y ) and (x) (y )
(y is dominated by x)
When is xk + sk acceptable?
Fletcher and Leyffer (2002), Fletcher, Gould, Leyffer, T. and W
achter (2002)
Philippe Toint (Namur)

April 2009

140 / 323

Derivative free optimization, filters and other topics

Filter algorithms

The standard filter


Idea: accept non-dominated points
no monotonicity of merit function implied
f (x)

6
r

r
r
0
Philippe Toint (Namur)

(x)
April 2009

141 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Filling up the standard filter


Note: filter area is bounded in the (f , ) space!
f (x)

f (xk )

f (xk ) k

(1 )k

0
(x)
filter area (non)-monotonically decreasing
Philippe Toint (Namur)

April 2009

142 / 323

Derivative free optimization, filters and other topics

Filter algorithms

The (unconstrained) feasibility problem


Feasibility
Find x such that
c(x) 0
e(x) = 0
for general smooth c and e.
Least-squares
Find x such that
min

Philippe Toint (Namur)

i2

April 2009

143 / 323

Derivative free optimization, filters and other topics

Filter algorithms

A multidimensional filter (1)


(Simple) idea: more dimensions in filter space
1 (x)

6
q

q
q
0

2 (x)

(full dimension vs. grouping)


Philippe Toint (Namur)

April 2009

144 / 323

Derivative free optimization, filters and other topics

Filter algorithms

A multidimensional filter (2)

Additionally
possibly consider unsigned filter entries
use a trust-region algorithm when
trial point unacceptable
convergence to non-zero solution

( internal restoration)

Sound convergence theory

Gould, Leyffer and T. (2005)

Philippe Toint (Namur)

April 2009

145 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Numerical experience: FILTRANE

Fortran 95 package
large scale problems (CUTEr interface)
includes several variants of the method
signed/unsigned filters
Gauss-Newton, Newton or adaptive models
pure trust-region option
uses preconditioned conjugate-gradients
+ Lanczos for subproblem solution

part of the GALAHAD library


Gould, Orban and T. (2003), Gould and T. (2007)

Philippe Toint (Namur)

April 2009

146 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Numerical experience (1)


1

fraction of problems for which solver in within of best

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1
Default
Pure trust region
0

10

Filter vs. trust-region (CPU time)


Philippe Toint (Namur)

April 2009

147 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Numerical experience (2)


1

0.9

0.8

0.7

p()

0.6

0.5

0.4

0.3

0.2

0.1
Filter
LANCELOT
0

10

Filter vs. LANCELOT B (CPU time)


Philippe Toint (Namur)

April 2009

148 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Numerical experience (3)


1

0.9

0.8

0.7

p()

0.6

0.5

0.4

0.3

0.2

0.1
Filter
Unfettered
0

10

Filter vs. free Newton (CPU time)


Philippe Toint (Namur)

April 2009

149 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Filter for unconstrained optimization


Again simple idea: use gi instead of i
g1 (x)

6
q

q
q
0

g2 (x)

(full dimension vs. grouping)


Philippe Toint (Namur)

April 2009

150 / 323

Derivative free optimization, filters and other topics

Filter algorithms

A few complications. . .
But . . .

g (x) = 0 not sufficient for nonconvex problems!

When negative curvature found:


reset filter
set upper bound on acceptable f (x)
(or. . . add a dimension for f in the filter)
reasonable convergence theory

Philippe Toint (Namur)

April 2009

151 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Numerical experience (1)


1

0.9

0.8

0.7

p()

0.6

0.5

0.4

0.3

0.2

0.1

Default
Pure trust region
LANB
1

10

Filter vs. trust-region and LANCELOT B (iterations)


Philippe Toint (Namur)

April 2009

152 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Numerical experience: HEART6


4

log of residual

10

12

14

10

20

30

40
iterations

50

60

70

80

Filter vs. trust-region and LANCELOT B


Philippe Toint (Namur)

April 2009

153 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Numerical experience: EXTROSNB


6

log of residual

50

100

150
iterations

200

250

300

Filter vs. trust-region and LANCELOT B


Philippe Toint (Namur)

April 2009

154 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Numerical experience: LOBSTERZ


6

log of residual

10

50

100

150
iterations

200

250

300

Filter vs. trust-region


Philippe Toint (Namur)

April 2009

155 / 323

Derivative free optimization, filters and other topics

Filter algorithms

Conclusions

derivative-free optimization possible and efficient


non-monotonicity definitely helpful
filter methods very efficient
Newtons behaviour unexplained
. . . more research needed?

Philippe Toint (Namur)

April 2009

156 / 323

Derivative free optimization, filters and other topics

Bibliography

Bibliography for lesson 3 (1)


1

A. R. Conn, K. Scheinberg and Ph. L. Toint,


A Derivative Free Optimization Algorithm in Practice,
Proceedings of the 7th AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, St.
Louis, Missouri, September 1998.
A. R. Conn, K. Scheinberg and L. N. Vicente,
Introduction to Derivative-free Optimization,
SIAM-MPS Series on Optimization, 2008.
R. Fletcher and S. Leyffer,
Nonlinear Programming without a penalty function,
Mathematical Programming A, 91(2):239-269, 2002.
R. Fletcher, N. I. M. Gould, S. Leyffer, Ph. L. Toint and A. W
achter,
Global Convergence of Trust-Region SQP-Filter Algorithms for Nonlinear Programming,
SIAM Journal on Optimization, 13(3):635-659, 2002.
N. I. M. Gould and S. Leyffer and Ph. L. Toint,
A Multidimensional Filter Algorithm for Nonlinear Equations and Nonlinear Least-Squares,
SIAM Journal on Optimization, 15(1):17-38, 2005.
N. I. M. Gould, D. Orban and Ph. L. Toint,
GALAHADa library of thread-safe Fortran 90 packages for large-scale nonlinear optimization, ACM Transactions on
Mathematical Software, 29(4):353-372, 2003.
N. I. M. Gould,C. Sainvitu and Ph. L. Toint,
A Filter-Trust-Region Method for Unconstrained Optimization,
SIAM Journal on Optimization, 16(2):341-357, 2005.
N. I. M. Gould and Ph. L. Toint,
FILTRANE, a Fortran 95 filter-trust-region package for solving systems of nonlinear equalities, nonlinear inequalities
and nonlinear least-squares problems,
ACM Transactions on Mathematical Software, 33(1):3-25, 2007.
M. J. D. Powell,
A direct search optimization method that models the objective by quadratic interpolation,
Presentation at the 5th Stockholm Optimization Days, Stockholm, 1994.
Philippe Toint (Namur)

April 2009

157 / 323

Derivative free optimization, filters and other topics

Bibliography

Bibliography for lesson 3 (2)

10 M. J. D. Powell,
Developments of NEWUOA for minimization without derivatives,
IMA Journal of Numerical Analysis, 28(4):649-664, 2008.
11 K. Scheinberg and Ph. L. Toint,
Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization,
Report TR09/06, FUNDP, Namur, 2009.
12 Ph. L. Toint,
A non-monotone trust-region algorithm for nonlinear optimization subject to convex constraints,
Mathematical Programming, 77(1):69-94, 1997.
13 D. Winfield,
Function Minimization by Interpolation in a Data Table,
Journal of the IMA, 12:339-347, 1973.

Philippe Toint (Namur)

April 2009

158 / 323

Convex constraints and interior-point methods

Lesson 4:
Optimization with
convex constraints

Philippe Toint (Namur)

April 2009

159 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

4.1: Projection algorithms

Philippe Toint (Namur)

April 2009

160 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

Projections on simple convex domains (1)


r

y = PC (y )
r

rP (y )
C

PC (y )
r
r

PC (y )
r

[x` ]i
def
[y ]i
[PC (y )]i =

[xu ]i

Philippe Toint (Namur)

ry

if [y ]i [x` ]i ,
if [x` ]i < [y ]i < [xu ]i ,
if [xu ]i [y ]i

April 2009

161 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

Projections on simple convex domains (2)

r PC (y )

y = PC (y )
r

Philippe Toint (Namur)

April 2009

162 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

Projections on simple convex domains (2)


. . . but also the ordered simplex . . .

0.8

0.6

x3

0.4

0.2

0
1
0.8

1
0.6

0.8
0.6

0.4

x2

0.4

0.2

0.2
0

x1

Idea: use those simple projections!


Philippe Toint (Namur)

April 2009

163 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

The projected gradient path


Define the projected gradient path = the Cauchy arc
p(t, x) = PC [x tx f (x)]

r x t f (x)
m x
r

p(t, x) = p(tm , x)

x f (x)
Philippe Toint (Namur)

April 2009

164 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

Two projections

PT (x) [x f (x)]6C 0

Philippe Toint (Namur)

PC [x x f (x)] xC 0

April 2009

165 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

Measuring criticality

Measure the gain in linearized objective function per step of length :





def
min
hx f (x), di
(x, ) =
x+dF ,kdk

(t) = kPF (x tg (x)) xk

Philippe Toint (Namur)

(x, ) =

(x)

April 2009

166 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

The criticality measure






min
hx f (x), di
(x) = (x, 1) =
x+dF ,kdk1
def

the feasible reduction in the linearized objective for unit steps


reduces to kx f (x)k2 in the unconstrained case

Philippe Toint (Namur)

April 2009

167 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

The projected gradient path and

k
xk r
r

Philippe Toint (Namur)

. projected path
xk + dk

April 2009

168 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

The generalized Cauchy point


Approximately minimize mk () on the PG path
Find

def

xkGC = PF [xk tkGC gk ] = xk + skGC

(tkGC > 0)

such that
mk (xkGC ) f (xk ) + ubs hgk , skGC i

(below linear approximation)

and either
mk (xkGC ) f (xk ) + lbs hgk , skGC i

(above linear approximation)

or
kPT (x GC ) [gk ]k epp |hgk , skGC i|
k

(close to paths end)

or
kskGC k frd k
Philippe Toint (Namur)

(close to TR boundary)
April 2009

169 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

Searching for the GCP (1)


5
4
3
2
1
0
1
2
3
4
5
6
0

0.2

0.4

0.6

0.8

1.2

1.4

1.6

mk (0 + s) = 3.57s1 1.5s2 s3 + s1 s2 + 3s2 + s2 s3 2s3 such that s 1.5 and 2.8

Philippe Toint (Namur)

April 2009

170 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

Searching for the GCP (2)


5
4
3
2
1
0
1
2
3
4
5
6
0

0.2

0.4

0.6

0.8

1.2

1.4

1.6

mk (0 + s) = 3.57s1 1.5s2 s3 + s1 s2 + 3s2 + s2 s3 2s3 such that s 1.5 and 1.8

Philippe Toint (Namur)

April 2009

171 / 323

Convex constraints and interior-point methods

Projections and the projected gradient path

Useful properties
Piecewise search for xkGC well-defined and finite
1

(, ), (, ) and (, ) are continuous

(x, ) is non-decreasing

(x, ) is non-decreasing

(x, ) is non-increasing

(xk ) (xk , kskGC k) + 2kPT (x GC ) [gk ]k

hgk , skGC i = (xk , kskGC k) 0

(xk , t) t kPT (x(t)) [x f (xk )]k

|(x) (y )| Lkx y k
if x f (x) is continuous on a bounded level set

Philippe Toint (Namur)

April 2009

172 / 323

Convex constraints and interior-point methods

Trust-region method for convex constraints

Cauchy decrease along the projected gradient path


The Cauchy condition: minimize mk long the projected gradient path



k
mk (xk ) mk (xk + sk ) CR k min
, k , 1
1 + kHk k
Idea: Linesearch conditions imply
mk (xk ) mk (xkGC ) ubs |hgk , skGC i| = ubs (xk , kskGC k)
but need
kPT (P[xk tj gk ]) [gk ]k epp

|hgk , sk (tj )i|


k

def

Now define k = min[1, k ] k . Then




k
mk (xk ) mk (xk ) dcp k min
, k
k
GC

Philippe Toint (Namur)

April 2009

173 / 323

Convex constraints and interior-point methods

Trust-region method for convex constraints

How far can we turn the handle?

As above. . .
All limit points are first-order critical, i.e.
lim k = 0

But . . .
does the active set settle ?
(needed for 2nd-order convergence or rate)

Philippe Toint (Namur)

April 2009

174 / 323

Convex constraints and interior-point methods

Trust-region method for convex constraints

Active constraints identification (1)


Require further assumptions: let L = { limit points of {xk } }
x L , {x ci (x )}iA(x ) are linearly independent
x L , x f (x ) ri{N (x )}
k, A(xkGC ) A(xk + sk )
For each connected component of limit points L(x ) L ,
there exists a set A {1, . . . , m} for which
A(x ) = A for all x L(x ).
Idea: connectivity + uniqueness of Lagrange multipliers
each L(x ) belongs to a single facet of C
Philippe Toint (Namur)

April 2009

175 / 323

Convex constraints and interior-point methods

Trust-region method for convex constraints

Active constraints identification (2)


There exists a (0, 1) such that
dist(x , L0 )
for every x L and each compact connected component of
limit points L0 such that A(L0 ) 6= A(x ).
Idea: continuity + compactness well separated
There exist (0, 14 ), (0, 1), and k1 0 such that, for
k k1 , there is a Lk such that
xk V(Lk , ) = {x IRn | dist(x, Lk ) }
and
A(x) A(Lk ) for all x V(Lk , ).
Idea: partition the complete sequence into convergent subsequences
each xk near a unique Lk
Philippe Toint (Namur)

April 2009

176 / 323

Convex constraints and interior-point methods

Trust-region method for convex constraints

Active constraints identification (3)


There exists k2 k1 such that, if for some k k2 ,
j A(Lk ) and j 6 A(xkGC ),
then, for some  (0, 1) independent of k and j,
k  .
Idea: complicated (uses criticality measures for incomplete constraint sets)
incomplete local A(xk ) implies not critical
(more technical arguments here)
There exists an active set A , such that
x L

A(x ) = A

and, for all k sufficiently large,


A(xk ) = A(xkGC ) = A
Philippe Toint (Namur)

April 2009

177 / 323

Convex constraints and interior-point methods

Trust-region method for convex constraints

Further convergence results

. . . and now it works in T (xk ) ( now continuous for large k ) with


xx mk remplaced by xx mk` xx `(xk , yk )
convergence to isolated critical points
(generalized) eigen-points for the Lagrangian
(needs consistent multiplier estimates!)
convergence to second-order points
fast asymptotic rate of convergence

Philippe Toint (Namur)

April 2009

178 / 323

Convex constraints and interior-point methods

Barriers and interior points

4.2: Barrier methods

Philippe Toint (Namur)

April 2009

179 / 323

Convex constraints and interior-point methods

Barriers and interior points

A simple case
Consider C = {x IRn | x 0} and build

log

def

(x, ) = f (x) he, log(x)i = f (x)

n
X

log(xi )

i=1

Under acceptable assumptions,


x () = arg min log (x, )
x

converge to the solution of the problem


min f (x)
xC

when & 0.
Philippe Toint (Namur)

April 2009

180 / 323

Convex constraints and interior-point methods

Barriers and interior points

How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

original objective function


Philippe Toint (Namur)

April 2009

181 / 323

Convex constraints and interior-point methods

Barriers and interior points

How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

original objective function + barrier ( = 50)


Philippe Toint (Namur)

April 2009

181 / 323

Convex constraints and interior-point methods

Barriers and interior points

How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

original objective function + barrier ( = 25)


Philippe Toint (Namur)

April 2009

181 / 323

Convex constraints and interior-point methods

Barriers and interior points

How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

original objective function + barrier ( = 10)


Philippe Toint (Namur)

April 2009

181 / 323

Convex constraints and interior-point methods

Barriers and interior points

How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

original objective function + barrier ( = 5)


Philippe Toint (Namur)

April 2009

181 / 323

Convex constraints and interior-point methods

Barriers and interior points

How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

original objective function + barrier ( = 2)


Philippe Toint (Namur)

April 2009

181 / 323

Convex constraints and interior-point methods

Barriers and interior points

Other barriers: reciprocals


b R() (x, ) =

n
X
i=1

1.8

1.8

1.6

1.6

1.4

1.4

1.2

1.2

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

1.8

1.8

1.6

1.6

1.4

1.4

1.2

1.2

0.8

0.8

0.6

0.6

0.4

0.4

0.2

1
[x]i

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

0.2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

1
2

( = 2, log + R( ), R(1) and R(2))


Philippe Toint (Namur)

April 2009

182 / 323

Convex constraints and interior-point methods

Barriers and interior points

The barrier function


def

(x, ) = f (x) + b(x, ) = f (x) he, log(x)i


Assume:
b(x, ) is defined for all x ri{C} and all > 0, and is
C 2 (ri{C}) w.r.t. x.
> 0,  > 0 bbh (, ) 1 such that
kxx b(x, )k bbh (, )
x C such that dist(x, C) 
limp b(yp , ) = + > 0 and {yp }
p=0 such that
yp ri{C} and

Philippe Toint (Namur)

lim dist(yp , C) = 0.

April 2009

183 / 323

Convex constraints and interior-point methods

Barriers and interior points

An elementary barrier algorithm


Algorithm 4.1: A simple barrier algorithm
Step 0: Initialization. Given: x0 ri{C}, 0 > 0. Set k = 0.
Step 1: Inner minimization. (Approximately) solve the problem
min (x, k )
x

by applying an unconstrained (inner) algorithm, starting from


a suitable starting point xk,0 ri{C}.
Let xk+1 be the corresponding (approximate) solution.
Step 2: Update the barrier parameter. Choose k+1 > 0 such that
lim k = 0.

Increment k by one and return to Step 1.


Philippe Toint (Namur)

April 2009

184 / 323

Convex constraints and interior-point methods

Barriers and interior points

A first inner primal algorithm


Algorithm 4.2: Inner primal 1
Step 0: Initialization. Given: xk,0 ri{C}, k,0 , 1 , 2 , 1 , 2 , k
Compute (x0 , k ), set j = 0.

(0, 1).

Step 1: Model definition. Define mk,j of (xk,j + s, k ) in Bk,j of the form


f
b
mk,j (xk,j + s) = mk,j
(xk,j + s)+mk,j
(xk,j + s),
Step 2: Step calculation. Compute sk,j that sufficiently reduces mk,j and
such that xk,j + sk,j Bk,j .
Step 3: Acceptance of the trial point. If xk,j + sk,j 6 C or if dist(xk,j + sk,j , C) <
k dist(xk,j , C), set k,j = , xk,j+1 = xk,j and go to Step 4.
Otherwise compute (xk,j + sk,j , k ) and
(xk,j , k ) (xk,j + sk,j , k )
.
k,j =
mk,j (xk,j ) mk,j (xk,j + sk,j )
Then if k,j 1 , define xk,j+1 = xk,j + sk,j ; otherwise define xk,j+1 = xk,j .
Step 4: Trust-region radius update. Set
8
< [k,j , )
[2 k,j , k,j ]
k,j+1
:
[1 k,j , 2 k,j ]
Increment
Philippe Toint (Namur)

if k,j 2 ,
if k,j [1 , 2 ),
if k,j < 1 .

j by one and go to Step 1.


April 2009

185 / 323

Convex constraints and interior-point methods

Barriers and interior points

Models and assumptions


Use separate models for f and b!
f
b
mk,j (xk,j + s) = mk,j
(xk,j + s) + mk,j
(xk,j + s),

Assume:
k,  > 0, bbmh (, k ) 1 k, j 0,
b
kxx mk,j
(x, k )k bbmh (, k )

x Bk,j C such that dist(x, C) .


k, j 0 x Bk,j ri{C},
f
kxx mk,j
(x)k umh

Philippe Toint (Namur)

April 2009

186 / 323

Convex constraints and interior-point methods

Barriers and interior points

(Inner) convergence properties


There exists mdb (k) (0, 1) such that
dist(xk,j , C) mdb (k)
for all j. Moreover, for all j and all x such that kx xk,j k
(1 k )dist(xj , C), we have that
kxx b(x, )k bbh (k mdb (k), k )
and

b
kxx mk,j
(xk,j , )k bbmh (k mdb (k), k )

If k,j (1 k )mdb (k), then


|(xk,j + sk,j , k ) mk,j (xk,j + sk,j )| ubh (k)2k,j

. . . and all the nice convergence properties follow!


Philippe Toint (Namur)

April 2009

187 / 323

Convex constraints and interior-point methods

Barriers and interior points

Constrained Cauchy and eigen-points (1)


Idea: restrict the step, not the trust region!

xk,j

k dist(xk,j , C)
@
@
@
R
@

k,j

xk,j + sk,j
ri{C}

But . . . what of sufficient decrease ???


Philippe Toint (Namur)

April 2009

188 / 323

Convex constraints and interior-point methods

Barriers and interior points

Constrained Cauchy and eigen-points (2)

Redefine the Cauchy arc:


def

CC
xk,j
(t) = {x | x = xk,j tgk,j , t 0, tkgk,j k (1 k )dk,j and x Bk },

kgk,j k
mk,j (xk,j )mk,j (xk,j ) kgk,j k min
, k,j , (1 k )dk,j
k,j
CC

1
2

. . . etc, etc, etc . . .

Philippe Toint (Namur)

April 2009

189 / 323

Convex constraints and interior-point methods

Barriers and interior points

A second inner primal algorithm


Algorithm 4.3: Inner primal 2
Step 0: Initialization. Given: xk,0 ri{C}, k,0 , 1 , 2 , 1 , 2 , k (0, 1).
Compute (xk,0 , k ), set j = 0.
Step 1: Model definition.

f
b
Define mk,j (xk,j + s) = mk,j
(xk,j + s) + mk,j
(xk,j + s)

Step 2: Step calculation. Define dk,j = dist(xk,j , C). Compute sk,j such that
xk,j + sk,j Bk,j C and dist(xk,j + sk,j , C) k dk,j
and such that it sufficiently reduces mk,j
Compute (xk,j + sk,j , k ) and
(xk,j , k ) (xk,j + sk,j , k )
k,j =
.
mk,j (xk,j ) mk,j (xk,j + sk,j )
1 , define xk,j+1 = xk,j + sk,j ; otherwise define xk,j+1 = xk,j .

Step 3: Acceptance of the trial point.

Then if k,j

Step 4: Trust-region radius update. Set


8
< [k,j , )
[2 k,j , k,j ]
k,j+1
:
[1 k,j , 2 k,j ]
Increment j by one and go to Step 1.
Philippe Toint (Namur)

if k,j 2 ,
if k,j [1 , 2 ),
if k,j < 1 .

April 2009

190 / 323

Convex constraints and interior-point methods

Barriers and interior points

The log barrier and its derivatives


Return to:
min f (x)
x0

The log barrier


b(x, ) = he, log(x)i
giving
log (x, ) = f (x) he, log(x)i
Using the notation X = diag(x1 , . . . , xn ), we obtain that
x b(x, ) = X 1 e

Philippe Toint (Namur)

and

xx b(x, ) = X 2 e

April 2009

191 / 323

Convex constraints and interior-point methods

Barriers and interior points

The primal log-barrier algorithm


Algorithm 4.4: Primal log-barrier algorithm
Step 0: Initialization. Given: x0 > 0, 0 > 0, and the forcing functions D () and
E (). Set k = 0.
Step 1: Inner minimization. Choose a value k (0, 1). Approximately minimize
the log-barrier function log (x, k ) = f (x) k he, log(x)i starting
from xk and using an inner algorithm in which

b
1
2
e, si + 12 hs, Xk,j
si
mk,j
(xk,j + s) = k he, log(xk,j )i hXk,j
Stop this algorithm as soon as an iterate xk,j = xk+1 is found for which
1
kx f (xk+1 ) k Xk+1
ek D (k ),
2
] E (k )
min [xx f (xk+1 ) + k Xk+1

and xk+1 > 0.


Step 2: Update the barrier parameter. Choose k+1 > 0 such that
limk k = 0. Increment k by one and return to Step 1.
Philippe Toint (Namur)

April 2009

192 / 323

Convex constraints and interior-point methods

Barriers and interior points

Convergence of the primal log-barrier algorithm (1)


OK for first order! . . . but existence of limit points not guaranteed
Define
A subsequence {xkj } is consistently active w.r.t. the bounds if,
for each i = 1, . . . , n, either
lim [xkj ]i = 0 or lim inf [xkj ]i > 0.

(Each bound constraint is asymptotically active or inactive for the


complete subsequence.)
def

A{xkj } = {i {1, . . . , n} | lim [xkj ]i = 0}.


j

Note: finite number of such subsequences a partition of {xk }


Philippe Toint (Namur)

April 2009

193 / 323

Convex constraints and interior-point methods

Barriers and interior points

Convergence of the primal log-barrier algorithm (2)


Finally,
Under appropriate assumptions,
lim inf [x f (xk )]i 0,
k

(i = 1, . . . , n).

Furthermore, for every consistently active subsequence {xk` },


lim [x f (xk` )]i = 0,

(i 6 A{xk` })

and
lim inf hu, [xx f (xk` )]ui 0
`

for each u | [u]i = 0 whenever i A{xk` }.

Philippe Toint (Namur)

April 2009

194 / 323

Convex constraints and interior-point methods

Barriers and interior points

The primal-dual framework (1)


2
In practice, as xk & 0, xx mk,j (xk,j ) + k Xk,j
causes slow progress.

Idea: replace this by


1
xx mk,j (xk,j ) + Xk,j
Zk,j

where Zk,j is a bounded positive diagonal.


Alternatively: KKT conditions for original problem:
x m(x) z = 0,

XZ = 0,

x 0,

z 0,

x m(x) z = 0,

XZ = e

x 0,

z 0.

Perturb:

Philippe Toint (Namur)

April 2009

195 / 323

Convex constraints and interior-point methods

Barriers and interior points

The primal-dual framework (2)


Now write Newtons method for the perturbed problem:
xx mk,j (xk,j )xk,j zk,j = gk,j + zk,j ,
Xk,j zk,j + Zk,j xk,j = k e Xk,j Zk,j e,
xk,j + xk,j 0, zk,j + zk,j 0.
Substituting the 2nd equation into the 1st:
h
i
h
i
1
1
xx mk,j (xk,j ) + Xk,j
Zk,j xk,j = gk,j k Xk,j
e
But
1
gk,j k Xk,j
e = x log (x, k )

Hence
h

i
1
xx mk,j (xk,j ) + Xk,j
Zk,j xk,j = x log (x, k )

Philippe Toint (Namur)

April 2009

196 / 323

Convex constraints and interior-point methods

Barriers and interior points

The primal-dual inner algorithm (1)


Algorithm 4.5: Inner primal-dual algorithm
Step 0: Initialization. Given: xk,0 ri{C}, zk,0 > 0 , k,0 , 1 , 2 , 1 ,2 , k .
Compute f (xk,0 ), set j = 0.
Step 1: Model definition. In Bk,j , define
h
i
1
f
1
mk,j (xk,j + s) = mk,j
(xk,j + s) k he, log(xk,j )i + hXk,j
e, si 21 hs, Xk,j
Zk,j si
Step 2: Step calculation. Define dk,j = dist(xk,j , C). Compute a step sk,j such
that xk,j + sk,j Bk,j , dist(xk,j + sk,j , C) k dk,j , and
(

"

mk,j (xk,j )mk,j (xk,j +sk,j ) max kgk,j k min

kgk,j k
k,j

Step 3: Acceptance of the trial point.


k,j =

#
, k,j , (1 k )dk,j

)
h
i
2
2
2 2
, k,j min k,j , k,j , (1k ) dk,j

Compute log (xk,j + sk,j , k ) and

log (xk,j , k ) log (xk,j + sk,j , k )


.
mk,j (xk,j ) mk,j (xk,j + sk,j )

If k,j 1 , then xk,j+1 = xk,j + sk,j , else xk,j+1 = xk,j .


Philippe Toint (Namur)

April 2009

197 / 323

Convex constraints and interior-point methods

Barriers and interior points

The primal-dual inner algorithm (2)

Algorithm 4.6: Inner primal-dual algorithm (2)

Step 4: Trust-region radius update. Set


8
< [k,j , )
[2 k,j , k,j ]
k,j+1
:
[1 k,j , 2 k,j ]

if k,j 2 ,
if k,j [1 , 2 ),
if k,j < 1 .

Step 5: Update the dual variables. Set zk,j+1 > 0. Increment j by one,go to
Step 1.

Philippe Toint (Namur)

April 2009

198 / 323

Convex constraints and interior-point methods

Barriers and interior points

The primal-dual outer algorithm

Algorithm 4.7: Outer primal-dual algorithm


Step 0: Initialization. Given: x0 > 0, z0 > 0, 0 > 0 and the forcing functions
D (), E (), C (). Set k = 0.
Step 1: Inner minimization. Choose k (0, 1). Approximately minimize log (x, k )
from xk using the primal-dual inner algorithm. Stop as soon as an iterate
(xk,j , zk,j ) = (xk+1 , zk+1 ) is found for which
kx f (xk+1 ) zk+1 k D (k ),
kXk+1 Zk+1 k I k C (k ),
and

1
min [xx f (xk+1 ) + Xk+1
Zk+1 ] E (k )

xk+1 > 0 and zk+1 > 0.


Step 3: Update the barrier parameter. Choose k+1 > 0 such that limk k = 0.
Increment k by one and return to Step 1.
1
Note: choosing zk,j = k Xk,j
e primal algorithm!
Philippe Toint (Namur)

April 2009

199 / 323

Convex constraints and interior-point methods

Barriers and interior points

Updating the dual variables


How to compute zk,j+1 in practice? Newton equations give
1
1
z k,j+1 = k Xk,j
e Xk,j
Zk,j sk,j .

. . . but what about zk,j+1 0?


Define

"
I = zul min

1
e, zk,j , k Xk,j+1
e

, zuu max

1
e, zk,j , 1
k e, k Xk,j+1 e

and choose

zk,j+1 =
Philippe Toint (Namur)

PI [z k,j+1 ]
zk,j

if xk,j+1 = xk,j + sk,j


if xk,j+1 = xk,j ,
April 2009

200 / 323

Convex constraints and interior-point methods

Barriers and interior points

Properties of the dual variables


Then zk,j+1 > 0 and

[zk,j ]i uzi max

1
[xk,j ]i


,1 .

If, furthermore,
lim ksk,j k = 0 when

lim kgk,j k = 0

then



1
lim zk,j k Xk,j
e = 0 if

lim kgk,j k = 0.

asymptotically exact barrier Hessian for fixed


Philippe Toint (Namur)

April 2009

201 / 323

Convex constraints and interior-point methods

Barriers and interior points

Scaling of the inner iterations


In practice, scaling is crucial!
Ideally,
k kk,j = k kxx mk,j (xk,j ) =

q
1
h, [Hk,j + Xk,j
Zk,j ]i

Under the usual assumptions, k kk,j is uniformly equivalent to


the Euclidean norm for fixed k.

xk,j
xk,j + sk,j
ri{C}
all usual convergence properties for fixed k
Philippe Toint (Namur)

April 2009

202 / 323

Convex constraints and interior-point methods

Barriers and interior points

Scaling of the outer iterations (1)

Scaled tests:
kx f (xk+1 ) zk+1 k[k+1] D (k )
kXk+1 Zk+1 k I k2 C (k ),

min

12
Mk+1
(xx f (xk+1 )

with

21
1
Xk+1
Zk+1 )Mk+1

E (k ),

def

1
Mk+1 = Hk+1 + Xk+1
Zk+1

But this matrix is unbounded when k % !

Philippe Toint (Namur)

April 2009

203 / 323

Convex constraints and interior-point methods

Barriers and interior points

Scaling of the outer iterations (2)


Fortunately,
Under the usual assumptions, the convergence properties are
preserved if
D (k )
lim

k
k
and

C (k ) k
lim
= 0.
k mini [xk+1 ]i

Moreover
If exact derivatives are used, the  (k ) can be chosen to ensure
componentwise near quadratic rate of convergence.
This is quite remarkable!
Philippe Toint (Namur)

April 2009

204 / 323

Convex constraints and interior-point methods

Barriers and interior points

Barriers for general convex constraints


Now,
log (x, ) = f (x) he, log(c(x))i
The primal-dual model becomes
f
b
mk,j (xk,j + sk,j ) = mk,j
(xk,j + sk,j ) + mk,j
(xk,j + sk,j ),

with
b
mk,j
(xk,j + sk,j )

k he, log(c(xk,j ))i k hC 1 (xk,j )e, A(xk,j )sk,j i


+ 12 hA(xk,j )sk,j , [C 1 (xk,j )Yk,j ]A(xk,j )sk,j i
12

m
X
[yk,j ]i hsk,j , xx ci (xk,j )sk,j i
i=1

Quite a mouthful. . . but otherwise everything is OK!

Philippe Toint (Namur)

April 2009

205 / 323

Convex constraints and interior-point methods

Barriers and interior points

The outer primal-dual algorithm for convex constraints


xx `(xk,j , yk,j ) = xx f (xk,j )

m
X
[yk,j ]i xx ci (xk,j )

def

Gk,j = A (xk,j )C

(xk,j )Yk,j A(xk,j )

i=1

Algorithm 4.8: Primal-dual algorithm for convex constraints


Step 0: Initialization Given: x0 | c(x0 ) > 0, y0 > 0, 0 > 0, C (), D () and E ().
Set k = 0.
Step 1: Inner minimization Choose k (0, 1). Approximately minimize
log (x, k ) = f (x) k he, log(c(x))i
from xk . Stop as soon as (xk,j , yk,j ) = (xk+1 , yk+1 ) is found such that
kx f (xk+1 ) AT (xk+1 )yk+1 k D (k ),
kC (xk+1 )Yk+1 e k I k C (k ),
min [xx `(xk+1 , yk+1 ) + Gk+1 ] E (k )
and
(c(xk+1 ), yk+1 ) 0.
Step 3: Update the barrier parameter. Choose k+1 > 0 such that
limk k = 0. Increment k by one and return to Step 1.
Philippe Toint (Namur)

April 2009

206 / 323

Convex constraints and interior-point methods

Bibliography

Bibliography for lesson 4 (1)


1

J. V. Burke,
On the identification of active constraints,
SIAM Journal on Numerical Analysis, 25(5):1197-1211, 1988.
J. V. Burke,
On the identification of active constraints II: the nonconvex case,
SIAM Journal on Numerical Analysis, 27(4):1081-1102, 1990.
J. V. Burke and J. J. Mor
e,
Exposing Constraints,
SIAM Journal on Optimization, 4(3):573-595, 1994.
J. V. Burke, J. J. Mor
e and G. Toraldo,
Convergence properties of trust region methods for linear and convex constraints,
Mathematical Programming A, 47(3):305-336, 1990.
A. R. Conn, N. I. M. Gould, D. Orban and Ph. L. Toint,
A primal-dual trust-region algorithm for minimizing a non-convex function subject to bound and linear equality
constraints,
Mathematical Programming, 87(2):215-249, 2000.
N. I. M. Gould, D. Orban, A. Sartenaer and Ph. L. Toint,
Superlinear Convergence of Primal-Dual Interior Point Algorithms for Nonlinear Programming,
SIAM Journal on Optimization, 11(4):974-1002, 2001.
A. R. Conn, N. I. M. Gould and Ph. L. Toint,
Global convergence of a class of trust region algorithms for optimization with simple bounds,
SIAM Journal on Numerical Analysis, 25(182):433-460, 1988.
A. R. Conn, N. I. M. Gould and Ph. L. Toint,
LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A),
Spinger Verlag, Heidelberg, 1992.
A. R. Conn, N. I. M. Gould, A. Sartenaer and Ph. L. Toint,
Global convergence of a class of trust region algorithms for optimization using inexact projections on convex
constraints,
SIAM Journal on Optimization, 3(1):164221, 1993.
Philippe Toint (Namur)

April 2009

207 / 323

Convex constraints and interior-point methods

Bibliography

Bibliography for lesson 4 (2)

10 A. V. Fiacco and G. P. McCormick,


The Sequential Unconstrained Minimization Technique for Nonlinear Programming: a Primal-Dual Method,
Management Science, 0(2):360-366, 1964.
11 A. V. Fiacco and G. P. McCormick,
Nonlinear Programming: Sequential Unconstrained Minimization Techniques,
Wiley and Sons, Chichester (UK), 1968.
12 Ph. L. Toint,
Global convergence of a class of trust region methods for nonconvex minimization in Hilbert space,
IMA Journal of Numerical Analysis, 8(2):231-252, 1988.

Philippe Toint (Namur)

April 2009

208 / 323

The use of problem structure for large-scale applications

Lesson 5:
Sparsity, partial separability
and multilevel methods:
exploiting problem structure

Philippe Toint (Namur)

April 2009

209 / 323

The use of problem structure for large-scale applications

Outline

Sparsity and partial separability

Multilevel problems

Bibliography

Philippe Toint (Namur)

April 2009

210 / 323

The use of problem structure for large-scale applications

Sparsity

5.1: Sparsity and


partial separability

Philippe Toint (Namur)

April 2009

211 / 323

The use of problem structure for large-scale applications

Sparsity

Sparsity
A matrix is sparse when the proportion and/or distribution of
its zero entries allows its efficient numerical usage
An (oriented) graph is asociated with every sparse (non)symmetric matrix

Philippe Toint (Namur)

April 2009

212 / 323

The use of problem structure for large-scale applications

Sparsity

Main benefits of sparsity


Sparsity and optimization Hessian (and) Jacobian matrices
very important time/space savings in solving Newtons equations
(unconstrained or KKT)
1
2

factorizations (reduced fill-in)


iterative methods (fast matrixvector products)

sometimes important in approximations schemes


1

2
3

derivative-free methods (makes the number of function evaluations


linear in the number of variables)
finite-difference approximations
quasi Newton methods

a path for parallel computations


exploiting sparsity = an active algorithmic industry!
Philippe Toint (Namur)

April 2009

213 / 323

The use of problem structure for large-scale applications

Sparsity

The Curtis-Powell-Reid algorithm for estimating sparse


Jacobians

Finite differences for a Jacobian column:


Jei

c(x + hei ) c(x)


h

Question: How many finite differences for estimating a 5 5 Jacobian


with the structure:
0

B
B
B
@

Philippe Toint (Namur)

C
C
C?
A

April 2009

214 / 323

The use of problem structure for large-scale applications

Sparsity

The Curtis-Powell-Reid algorithm for estimating sparse


Jacobians

Finite differences for a Jacobian column:


Jei

c(x + hei ) c(x)


h

Question: How many finite differences for estimating a 5 5 Jacobian


with the structure:
0

B
B
B
@

Philippe Toint (Namur)

C
C
C
A

April 2009

214 / 323

The use of problem structure for large-scale applications

Sparsity

The Curtis-Powell-Reid algorithm for estimating sparse


Jacobians

Finite differences for a Jacobian column:


Jei

c(x + hei ) c(x)


h

Question: How many finite differences for estimating a 5 5 Jacobian


with the structure:
0

B
B
B
@

Philippe Toint (Namur)

C
C
C
A

April 2009

214 / 323

The use of problem structure for large-scale applications

Sparsity

The Curtis-Powell-Reid algorithm for estimating sparse


Jacobians

Finite differences for a Jacobian column:


Jei

c(x + hei ) c(x)


h

Question: How many finite differences for estimating a 5 5 Jacobian


with the structure:
0

B
B
B
@

Philippe Toint (Namur)

C
C
C
A

April 2009

214 / 323

The use of problem structure for large-scale applications

Sparsity

The Curtis-Powell-Reid algorithm for estimating sparse


Jacobians
Finite differences for a Jacobian column:
Jei

c(x + hei ) c(x)


h

Question: How many finite differences for estimating a 5 5 Jacobian


with the structure:
0

B
B
B
@

Je

c(x + he1 + he4 ) c(x)


h

Je

Answer: 3 finite-differences!
Philippe Toint (Namur)

C
C
C
A

c(x + he2 + he3 ) c(x)


h

Je

c(x + he5 ) c(x)


h

Curtis, Powell and Reid (1974), Steihaug et al.


April 2009

214 / 323

The use of problem structure for large-scale applications

Sparsity

The CPR algorithm for estimating sparse Jacobians


Algorithm 5.1: CPR algorithm
Build the column groups.
Place the columns in as few groups as possible such that
two columns in the same group have their nonzero entries in
different rows
Estimate the finite differences.
1
2

P
Build a difference vector h = group hi ei
Compute v = c(x + h) c(x)

Reconstruct the Jacobian.


Jij

Philippe Toint (Namur)

vi
hi

for all j such that j group

April 2009

215 / 323

The use of problem structure for large-scale applications

Sparsity

A graph colouring interpretation


Consider the intersection graph for the columns:
2

Philippe Toint (Namur)

April 2009

216 / 323

The use of problem structure for large-scale applications

Sparsity

A graph colouring interpretation


Consider the intersection graph for the columns:
2

Philippe Toint (Namur)

April 2009

216 / 323

The use of problem structure for large-scale applications

Sparsity

A graph colouring interpretation


Consider the intersection graph for the columns:
2

Philippe Toint (Namur)

April 2009

216 / 323

The use of problem structure for large-scale applications

Sparsity

A graph colouring interpretation


Consider the intersection graph for the columns:
2

minimize the number of colours,


such that adjacent nodes have different colours
can build column groups using heuristic algorithms for graph colouring
Coleman and More, (1983)
Philippe Toint (Namur)

April 2009

216 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (1)


Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:

B
B
B
B
B
B
B
B
B
@

Philippe Toint (Namur)

C
C
C
C
C
C?
C
C
C
A

April 2009

217 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (1)


Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:

B
B
B
B
B
B
B
B
B
@

Philippe Toint (Namur)

C
C
C
C
C
C
C
C
C
A

April 2009

217 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (1)


Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:

B
B
B
B
B
B
B
B
B
@

Philippe Toint (Namur)

C
C
C
C
C
C
C
C
C
A

April 2009

217 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (1)


Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:

B
B
B
B
B
B
B
B
B
@

Philippe Toint (Namur)

C
C
C
C
C
C
C
C
C
A

April 2009

217 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (1)


Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:

B
B
B
B
B
B
B
B
B
@

Philippe Toint (Namur)

C
C
C
C
C
C
C
C
C
A

April 2009

217 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (1)


Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:

B
B
B
B
B
B
B
B
B
@

Philippe Toint (Namur)

C
C
C
C
C
C
C
C
C
A

April 2009

217 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (1)


Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:

B
B
B
B
B
B
B
B
B
@

Philippe Toint (Namur)

C
C
C
C
C
C
C
C
C
A

April 2009

217 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (1)


Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:

B
B
B
B
B
B
B
B
B
@

C
C
C
C
C
C
C
C
C
A

Exploiting symmetry in CPR ( a direct method)

Powell and T (1979), Coleman and More (1984)


Philippe Toint (Namur)

April 2009

217 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?
0

B
B
B
B
B
B
B
B
B
@

Philippe Toint (Namur)

C
C
C
C
C
C
C
C
C
A

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?
0

B
B
B
B
B
B
B
B
B
@

Philippe Toint (Namur)

C
C
C
C
C
C
C
C
C
A

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?

B
B
B
B
B
B
B
B
B
@
0

C
C
C
C
C
C
C
C
C
A

Apply CPR on the lower triangular part of the Hessian

Philippe Toint (Namur)

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?

B
B
B
B
B
B
B
B
B
@

C
C
C
C
C
C
C
C
C
A

But what about the conflicts with the upper triangular part?

Philippe Toint (Namur)

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?

B
B
B
B
B
B
B
B
B
@

C
C
C
C
C
C
C
C
C
A

A more efficient substitution method. . .

Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?

B
B
B
B
B
B
B
B
B
@

C
C
C
C
C
C
C
C
C
A

A more efficient substitution method. . .

Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?

B
B
B
B
B
B
B
B
B ?
@

C
C
C
C
C
C
C
C
C
A

A more efficient substitution method. . .

Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?

B
B
B
B
B
B
B
B
B
@

C
C
C
C
C
C
C
C
C
A

A more efficient substitution method. . .

Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?

B
B
B
B
B
B
B
B
B
@

C
C
C
C
C
C
C
C
C
A

A more efficient substitution method. . .

Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?

B
B
B ?
B
B
B
B
B
B
@

C
C
C
C
C
C
C
C
C
A

A more efficient substitution method. . .

Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Estimating sparse Hessians (2)


Question: Can we do better?
0

B
B
B
B
B
B
B
B
B
@

C
C
C
C
C
C
C
C
C
A

A more efficient substitution method. . .

Powell and T (1979), Coleman and More (1984) for a graph interpretation

Philippe Toint (Namur)

April 2009

218 / 323

The use of problem structure for large-scale applications

Sparsity

Optimized version for PDE stencils

Example: the 5-points Laplacian operator in 2D


(non-symmetric and symmetric)

Philippe Toint (Namur)

April 2009

219 / 323

The use of problem structure for large-scale applications

Sparsity

Optimized version for PDE stencils

Example: the 5-points Laplacian operator in 2D


(non-symmetric and symmetric)

Philippe Toint (Namur)

April 2009

219 / 323

The use of problem structure for large-scale applications

Sparsity

Partial separability
A more geometric concept:

Griewank and T. (1982)

f (x) is partially separable iff


f (x) =

p
X

fi (Ui x) where the matrices Ui are of low rank

i=1

if Ui = disjoint columns of the identity matrix (totally) separable


common case: Ui = overlapping columns of the identity matrix
f (x) =

p
X

fi (xSi )

i=1

Vocabulary:
element functions, element variables, internal variables ui = Ui x
Philippe Toint (Namur)

April 2009

220 / 323

The use of problem structure for large-scale applications

Sparsity

Sources and examples of partially separable functions


Example 1:
f (x1 , x2 , x3 , x4 ) = f1 (x1 , x2 ) + f2 (x2 , x3 , x4 ) + f3 (x4 , x5 )
Example 2:
f (x1 , x2 , x3 , x4 ) = f1 (3x1 + x2 ) + f2 (2x2 + x3 2x4 , x4 + 3x5 )
|
{z
} | {z }
| {z }
u1
u2
u3
Sources:
(nearly) all discretized problems
most problems in econometric modelling,
. . . and a lot more because. . .

Philippe Toint (Namur)

April 2009

221 / 323

The use of problem structure for large-scale applications

Sparsity

Properties of partially separable functions


If f (x) has a sparse Hessian matrix and is sufficiently smooth,
then it is partially separable
(but not conversely: ex : f (x1 , . . . , xn ) =
If f (x) =

Pp

i=1 fi (Ui x)

Pp

Pn

i=1 fi (xi )

i=1 fi (ui ),

x f (x) =

p
X

+ fn+1 (x1 + + xn )

then

UiT x fi (ui )

i=1

xx f (x) =

p
X

UiT xx fi (ui )Ui

i=1

(easy to compute, sparsity determined by Ui )


Philippe Toint (Namur)

April 2009

222 / 323

The use of problem structure for large-scale applications

Sparsity

The three points Laplacian operator

On a regular geometric grid


2
B 1
B
B
B
B
@

1
2
1

1
2
1

0
B
B
B
+B
B
@

0
2
C
B 1
C
B
C
B
C=B
C
B
@
1 A
2
1

1
2
1

1
2
1

1
1
1

1
1

1
1

C B
C B
C B
C+B
C B
A @

1
1

C B
C B
C B
C+B
C B
A @

1
1

1
1
1

1
1

C
C
C
C
C
A

C B
C B
C B
C+B
C B
A @

C
C
C
C
C
1 A
2

1
1

Note: Sum of rank one submatrices (ui = xi+1 xi )!

Philippe Toint (Namur)

April 2009

223 / 323

The use of problem structure for large-scale applications

Sparsity

Using the partially separable structure

Very useful for:


quasi-Newton Hessian matrix = sum of elementwise quasi-Newton
low rank submatrices (partitioned updating),
elementwise models in DFO (number of functions evaluations only
dependent of the maximum number of internal variables!),
optimally efficient finite-difference approximations,
(structured trust-regions),
expressing large-scale models.
LANCELOT based on an extension of this concept

Philippe Toint (Namur)

April 2009

224 / 323

The use of problem structure for large-scale applications

Sparsity

Exploitation of the computational tree

Idea: use computational tree for f (x) for solving Newtons equations
use chain-rule at the top of the computational tree
multiplicative decompositions (and partially separable)
often available from the problem modelling itself
Substantial computational gains
unpublished (?) by T. Coleman (2008)

Philippe Toint (Namur)

April 2009

225 / 323

The use of problem structure for large-scale applications

Multilevel problems

5.3: Multilevel problems

Philippe Toint (Namur)

April 2009

226 / 323

The use of problem structure for large-scale applications

Multilevel problems

Multilevel Optimization: The Problem

min f (x)

xIRn

f : IRn IR nonlinear, C 2 and bounded below


No convexity assumption
Results from the discretization of some infinite-dimensional problem
on a relatively fine grid for instance (n large)

Iterative search of a first-order critical point x (s.t. f (x ) = 0)

Philippe Toint (Namur)

April 2009

227 / 323

The use of problem structure for large-scale applications

Multilevel problems

Hierarchy of problem descriptions


Assume now that a hierarchy of problem descriptions is available, linked by
known operators
Finest problem description
Restriction R

P Prolongation

Fine problem description


Restriction R

P Prolongation
...

Restriction R

P Prolongation

Coarse problem description


Restriction R

P Prolongation

Coarsest problem description


Philippe Toint (Namur)

April 2009

228 / 323

The use of problem structure for large-scale applications

Multilevel problems

Grid transfer operators


Restriction

Prolongation

Ri : IRni IRni1

Pi : IRni1 IRni

Philippe Toint (Namur)

Ri = PiT

April 2009

229 / 323

The use of problem structure for large-scale applications

Multilevel problems

Sources for such problems

Parameter estimation in
discretized ODEs
discretized PDEs
Optimal control problems
Optimal surface design (shape optimization)
Data assimilation in weather forecast (different levels of physics in the
models)

Philippe Toint (Namur)

April 2009

230 / 323

The use of problem structure for large-scale applications

Multilevel problems

The minimum surface problem


Z

1Z 1

min
v

1 + (x v )2 + (y v )2

with the boundary conditions:

f (x), y = 0, 0 x 1

0,
x = 0, 0 y 1
f (x), y = 1, 0 x 1

0,
x = 1, 0 y 1

 21

dx dy

Discretization using a finite


element basis

where
f (x) = x (1 x)

Philippe Toint (Namur)

April 2009

231 / 323

The use of problem structure for large-scale applications

Multilevel problems

The solution at different levels


solution at level 2 (converged)

solution at level 1 (converged)

2.5

solution at level 3 (converged)

1.5

0.9
0.8

0.7
1

1.5

0.6
0.5
0.4

0.5

0.3
0.2

0.5

0.1
0
5

0
10

0
20
8

2
1

n = 32 = 9

10

solution at level 6

0.25

0.25

0.2

0.2

0.15

0.15

0.2

0.1

0.1

0.1

0.05

0.05

0
40

0
80
30
20

15

20

25

30

n = 312 = 961
Philippe Toint (Namur)

35

n = 152 = 225

0.3

5
0

0.4

10

10

solution at level 5 (converged)

0.5

10

15

10

n = 72 = 49

solution at level 4 (converged)

20

15

0
150
60
40
20
0

10

20

30

40

50

n = 632 = 3969

60

70

100
50
0

20

40

60

80

100

120

140

n = 1272 = 16129
April 2009

232 / 323

The use of problem structure for large-scale applications

Multilevel problems

The main issue


Hierarchy of problem descriptions

globalization technique

&

Efficiency Robustness

Illustration within a trust-region framework

(Unconstrained case)

Philippe Toint (Namur)

April 2009

233 / 323

The use of problem structure for large-scale applications

Multilevel problems

Past and recent developments


Line-search
Fisher (1998), Frese-Bouman-Sauer (1999), Nash (2000)
Lewis-Nash (2000, 2002)
Oh-Milstein-Bouman-Webb (2003)
Wen-Goldfarb (2007, 2008)
Gratton-T (2007)

Trust-region
Gratton-Sartenaer-T (2006, 2008)
Gratton-Mouffe-T-Weber Mendonca (2009)
Gratton-Mouffe-Sartenaer-T-Tomanos (2009)
T-Tomanos-Weber Mendonca (2009)
Gross-Krause (2008)

Philippe Toint (Namur)

April 2009

234 / 323

The use of problem structure for large-scale applications

Multilevel problems

On the side of multigrid methods


Consider the linear system (discrete Poisson equation, for instance):
Ax = b

Ae = r

(residual equation)

(error)

(exact solution)

(approximation)

where
e = x x

r = b A
x (residual)

Expansion of e in Fourier modes shows high (oscillatory) and low (smooth)


frequency components:
Fourier modes
1
0.8
0.6
0.4
0.2
0
0.2
0.4

k=3
k = 24

0.6

k = 3 and k = 24

0.8
1
0

Philippe Toint (Namur)

10

20

30

40

50

60

April 2009

235 / 323

The use of problem structure for large-scale applications

Multilevel problems

Relaxation methods
Basic iterative methods:
correct the i th component of xk in the order 1, 2, . . . , n
to annihilate the i th component of rk
Jacobi

n
X
1
[xk+1 ]i =

aij [xk ]i + [b]i


aii
j=1, j6=i

Gauss-Seidel

[xk+1 ]i =

aii

i1
X
j=1

aij [xk+1 ]i

n
X

aij [xk ]i + [b]i

j=i+1

Solve the equations of the linear system one by one


Philippe Toint (Namur)

April 2009

236 / 323

The use of problem structure for large-scale applications

Multilevel problems

Smoothing effect
Very effective methods at smoothing, i.e., eliminating the
high-frequency (oscillatory) components of the error:

0.2

0.02

1.5

0.15

0.015

0.1

0.01

0.5

0.05

0.005

0
40

30
30

20
20
10

10
0

error of
initial guess

0
40
30

40
30

20
20

10

10
0

error after 10
GS iterations

30

40
30

20
20

10

10
0

error after 100


GS iterations

But they leave the low-frequency (smooth) components relatively


unchanged
Philippe Toint (Namur)

April 2009

237 / 323

The use of problem structure for large-scale applications

Multilevel problems

Multigrid in linear algebar

Assume now (two levels):

A fine grid (f ) description

Ae = r Af e f = r f

A coarse grid (c) description

Ac e c = r c

Linked by transfer operators

Ac = RAf P, e c = Re f , r c = Rr f

Philippe Toint (Namur)

April 2009

238 / 323

The use of problem structure for large-scale applications

Multilevel problems

Coarse grid principle

Smooth error modes on a fine grid


look less smooth on a coarse grid
When relaxation begins to stall at the finer level:
Move to the coarser grid where the smooth error modes appear more
oscillatory
Apply a relaxation at the coarser level:
more efficient
substantially less expensive

Philippe Toint (Namur)

April 2009

239 / 323

The use of problem structure for large-scale applications

Multilevel problems

Two-grid correction scheme

Annihilate oscillatory error level by level:

Fine 

smooth

Smooth fine 

Smaller fine 

R
Oscil. coarse 

P
smooth

(recur)

smooth

Smooth coarse 

Note: P and R are not othogonal projectors!


A very efficient method for some linear systems
(when A(smooth modes) smooth modes)

Philippe Toint (Namur)

April 2009

240 / 323

The use of problem structure for large-scale applications

Multilevel problems

Does it work?
Smoothing on fine grid only:
2

0.2

0.02

1.5

0.15

0.015

0.1

0.01

0.5

0.05

0.005

0
40

30

0
40
30

30

20
10

40
20

10

10
0

30

20

20

10

10
0

30

40
30

20

20

10
0

Two-grid correction scheme:


2

0.2

0.02

1.5

0.15

0.015

0.1

0.01

0.5

0.05

0.005

0
30

0
30

30

20

30
30

20

20
10

10
0

k=0
Philippe Toint (Namur)

30

20

20
10

10
0

k = 10

20
10

10
0

k = 100
April 2009

241 / 323

The use of problem structure for large-scale applications

Multilevel problems

V-cycle

k+1

Smoothing
Philippe Toint (Namur)

April 2009

242 / 323

The use of problem structure for large-scale applications

Multilevel problems

W-cycle

k+1

Smoothing

Philippe Toint (Namur)

April 2009

243 / 323

The use of problem structure for large-scale applications

Multilevel problems

Mesh Refinement

Solve the problem on the coarsest level


Good starting point for the next fine level

Do the same on each level


Good starting point for the finest level

Finally solve the problem on the finest level

Philippe Toint (Namur)

April 2009

244 / 323

The use of problem structure for large-scale applications

Multilevel problems

Full Multigrid Scheme

Combination of Mesh Refinement and V-cycles

Philippe Toint (Namur)

April 2009

245 / 323

The use of problem structure for large-scale applications

Multilevel problems

Return to optimization
Hierarchy of problem descriptions

Trust-region technique

&

Efficiency Robustness

Multilevel optimization method

Note: Multilevel More-Sorensen algorithm: (Hk + I ) s = gk


T-Tomanos-Weber Mendonca, 2009
Philippe Toint (Namur)

April 2009

246 / 323

The use of problem structure for large-scale applications

Multilevel problems

The framework
Assume that we have:
A hierarchy of problem descriptions of f :

{fi }ri=0

with

fr (x) = f (x)

Transfer operators, for i = 1, . . . , r :


Ri : IRni IRni1

(the restriction)

Pi : IRni1 IRni

(the prolongation)

Terminology: a particular i is referred to as a level


Philippe Toint (Namur)

April 2009

247 / 323

The use of problem structure for large-scale applications

Multilevel problems

The idea

min fr (x) = f (x)

xIRn

at xk :

minimize Taylors model of fr around xk


in the trust region of radius k

or (whenever suitable and desirable)


at xk :

compute fr (xk ) (possibly Hk )

trial step sk

Restriction R

P Prolongation

use fr 1 to construct a coarse local model of fr


and minimize it within the trust region of radius k

If more than two levels are available (r > 1), do this recursively
Philippe Toint (Namur)

April 2009

248 / 323

The use of problem structure for large-scale applications

Multilevel problems

Example of recursion with 5 levels (r = 4)


k

Level 4
R4
Level 3

P4
0

R3
Level 2

P3
0

R2
Level 1

P2
0

R1
Level 0

P1
0

Notation:

R2

P2
0

R1

P 1 R1
0

P1
0

i: level index (0 i r )

k: index of the current iteration within level i

Philippe Toint (Namur)

April 2009

249 / 323

The use of problem structure for large-scale applications

Multilevel problems

Construction of the coarse local models


If fi 6= 0 for i = 0, . . . , r 1
Impose first-order coherence via a correction term:
glow = Rgup
Impose second-order coherence() via two correction terms:
glow = Rgup

and

Hlow = RHup P

() Not needed to derive first-order global convergence

If fi = 0 for i = 0, . . . , r 1
Galerkin model: Restricted version of the quadratic model at the
upper level
Philippe Toint (Namur)

April 2009

250 / 323

The use of problem structure for large-scale applications

Multilevel problems

Preserving the trust-region constraint (1)

up

xlow,0

xlow,k

up kxlow,k xlow,0 k
+
low



min +
,

kx

x
k
up
low
,k
low
,0
low

Note: Motivation to switch to -norm


Gratton, Sartenear, T (2008)

Philippe Toint (Namur)

April 2009

251 / 323

The use of problem structure for large-scale applications

Multilevel problems

Preserving the trust-region constraint (2)

In infinity norm:



min +
low , up kxlow ,k xlow ,0 k
Gratton, Mouffe, T, Weber Mendonca (2008)

Philippe Toint (Namur)

April 2009

252 / 323

The use of problem structure for large-scale applications

Multilevel problems

Use the coarse model whenever suitable

def

When kglow k = kRgup k kgup k

(Coarsening condition)

and
def

When kglow k = kRgup k > low


and
When i > 0

Philippe Toint (Namur)

April 2009

253 / 323

The use of problem structure for large-scale applications

Multilevel problems

Use the coarse model whenever desirable

Taylor model (Taylor step)

Coarse model (recursive step)

smoothing

coarsening

&

Alternate for efficiency (multigrid)

Be as flexible as possible

Leave the choice even when the coarse model is suitable

Philippe Toint (Namur)

April 2009

254 / 323

The use of problem structure for large-scale applications

Multilevel problems

Recursive multilevel trust-region algorithm (RMTR)


At iteration k

(until convergence):

Choose either a Taylor or (if suitable) a coarse local model


(first-order coherent):
Taylor model: compute a Taylor step
Coarse local model: apply the algorithm recursively
Evaluate the change in the objective function
If achieved decrease predicted decrease, then
accept the trial point
possibly enlarge the trust region
else
keep the current point
shrink the trust region
Impose current trust region upper level trust region
Philippe Toint (Namur)

April 2009

255 / 323

The use of problem structure for large-scale applications

Multilevel problems

Global convergence
Based on the trust-region technology
Uses the sufficient decrease argument (imposed in Taylors iterations)
Plus the coarsening condition (kRgup k kgup k)
Main result

lim kgr ,k k = 0

Gratton, Sartenaer, (2008)


Philippe Toint (Namur)

April 2009

256 / 323

The use of problem structure for large-scale applications

In more details

Intermediate results
At iteration (i, k) we associate the set:

def

R(i, k) = {(j, `) | iteration (j, `) occurs within iteration (i, k)}

Level 4
R4
Level 3

P4
0

R3
Level 2

P3
0

R2
Level 1

P2
0

R1
Level 0
Philippe Toint (Namur)

P1
0

R2

P2
0

R1

P 1 R1
0

P1
0

April 2009

257 / 323

The use of problem structure for large-scale applications

In more details

Let
def

V(i, k) = { (j, `) R(i, k) | mj,` kgi,k kj,` }


|
{z
}
sufficient decrease
Then, at a non critical point and if the trust region is small enough:

V(i, k) = R(i, k)

Back to classical trust-region arguments

Philippe Toint (Namur)

April 2009

258 / 323

The use of problem structure for large-scale applications

In more details

Premature termination
For a recursive iteration (i, k):
A minimization sequence at level i 1 initiated at iteration (i, k)
denotes all successive iterations at level i 1
until a return is made to level i
k

Level 4
R4
Level 3

R3
Level 2

P3
0

R2
Level 1

P2
0

R1
Level 0
Philippe Toint (Namur)

P1
0

R2

P2
0

R1

P 1 R1
0

P1
0

April 2009

259 / 323

The use of problem structure for large-scale applications

In more details

Properties of RMTR
Each minimization sequence contains at least one successful iteration
Premature termination in that case does not affect the convergence
results at the upper level
Which allows
Stop a minimization sequence after a preset number of successful
iterations
Use fixed lower-iterations patterns like the V or W cycles in multigrid
methods
Philippe Toint (Namur)

April 2009

260 / 323

The use of problem structure for large-scale applications

In more details

A practical RMTR algorithm: Taylor iterations

At the coarsest level


Solve using the exact More-Sorensen method
(small dimension)
At finer levels
Smooth using a smoothing technique from multigrid
(to reduce the high frequency residual/gradient components)

Philippe Toint (Namur)

April 2009

261 / 323

The use of problem structure for large-scale applications

In more details

SCM Smoothing

Adaptation of the Gauss-Seidel smoothing technique to optimization:


Sequential Coordinate Minimization (SCM smoothing)

Successive one-dimensional minimizations of the model


along the coordinate axes when positive curvature

Cost: 1 SCM smoothing cycle 1 matrix-vector product

Philippe Toint (Namur)

April 2009

262 / 323

The use of problem structure for large-scale applications

In more details

Three issues

How to impose sufficient decrease in the model ?

How to impose the trust-region constraint ?

What to do if a negative curvature is encountered ?

Philippe Toint (Namur)

April 2009

263 / 323

The use of problem structure for large-scale applications

In more details

Start the first SCM smoothing cycle


by minimizing along the largest gradient component
(enough to ensure sufficient decrease)
Perform (at most) p SCM smoothing cycles
while inside the trust region (reasonable cost)
Terminate
when an approximate minimizer is found (Stop)
when the trust-region boundary is passed (Stop at the boundary)
when a direction of negative curvature is encountered
(move to the boundary and Stop)
Philippe Toint (Namur)

April 2009

264 / 323

The use of problem structure for large-scale applications

In more details

Convergence to weak minimizers


SCM smoothing limits its exploration of the models curvature to the
coordinate axes only guarantees asymptotic positive curvature:
along the coordinate axes at the finest level (i = r )
along the the prolongation of the coordinate axes at levels
i = 1, . . . , r 1
along the prolongation of the coarsest subspace (i = 0)

Weak minimizers
Gratton, Sartenaer, T (2006)
Philippe Toint (Namur)

April 2009

265 / 323

The use of problem structure for large-scale applications

Numerical results

Some numerical flavor


Gratton, Mouffe, Sartenaer, T, Tomanos (2009)

All on Finest (AF)


Standard Newton trust-region algorithm (TCG)
Applied at the finest level
Multilevel on Finest (MF)
Algorithm RMTR
Applied at the finest level
Mesh Refinement (MR)
Standard Newton trust-region algorithm (TCG)
Applied successively from coarsest to finest level()
Full Multilevel (FM)
Algorithm RMTR
Applied successively from coarsest to finest level()
()

Starting point at level i + 1 obtained by prolongating the solution at level i


Philippe Toint (Namur)

April 2009

266 / 323

The use of problem structure for large-scale applications

Numerical results

Test problem characteristics


Problem name
DNT
P2D
P3D
DEPT
DPJB?
DODC
MINS-SB
MINS-OB
MINS-DMSA
IGNISC
DSSC
BRATU
MINS-BC?
MEMBR?
NCCS
NCCO
MOREBV

nr
511
1.046.529
250.047
1.046.529
1.046.529
65.025
1.046.529
65.025
65.025
65.025
1.046.529
1.046.529
65.025
393.984
103.050
103.050
1.046.529

r
8
9
5
9
9
7
9
7
7
7
9
9
7
9
7
7
9

Type
1-D,
2-D,
3-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,

quadratic
quadratic
quadratic
quadratic
quadratic
convex
convex
convex
convex
convex
convex
convex
convex
convex
nonconvex
nonconvex
nonconvex

Description
Dirichlet-to-Neumann transfer problem
Poisson model problem
Poisson model problem
Elastic-plastic torsion problem
Journal bearing problem
Optimal design problem
Minimium surface problem
Minimium surface problem
Minimium surface problem
Combustion problem
Combustion problem
Combustion problem
Minimium surface problem
Membrane problem
Optimal control problem
Optimal control problem
Boundary value problem

?: with bound constraints


Philippe Toint (Namur)

April 2009

267 / 323

The use of problem structure for large-scale applications

Numerical results

Performance profiles (CPU time)


1

0.8

0.6

0.4

FM
MR
MF
AF

0.2

10

20

Philippe Toint (Namur)

30

40

50

60

70

80

90

100

April 2009

268 / 323

The use of problem structure for large-scale applications

Numerical results

Zoom on on efficiency (CPU time)


1

0.8

0.6

0.4

FM
MR
MF
AF

0.2

0
1

Philippe Toint (Namur)

10
April 2009

269 / 323

The use of problem structure for large-scale applications

Numerical results

CPU times
Problem name
DNT
P2D
P3D
DEPT
DPJB
DODC
MINS-SB
MINS-OB
MINS-DMSA
IGNISC
DSSC
BRATU
MINS-BC
MEMBR
NCCS
NCCO
MOREBV

Best
Philippe Toint (Namur)

AF

5.2
1122.8
626.1
1364.4
3600.0
894.8
3600.0
1445.6
1196.8
2330.4
3183.8
2314.1
2706.4
1082.0
3600.0
3600.0
3600.0

MF
24.4

MR

72.8

569.7

47.5

18.3

4.6

69.5

95.4

1390.0

247.7

58.6
70.4
73.4
398.3

184.2
3600.0
116.7
289.6
488.2

1051.6
236.8

122.3
91.7

3600.0

161.8

524.6

335.2
3600.0
3600.0

292.4
279.5
3589.6

704.9

3600.0

FM
6.7

26.0
28.8
8.6
83.6
36
153.9
27.5
18.2
398.2
12.1
10.1
140.0
154.0
331.9
224.2
41.7

Second best
April 2009

270 / 323

The use of problem structure for large-scale applications

Numerical results

A glimpse at the solution process

Philippe Toint (Namur)

April 2009

271 / 323

The use of problem structure for large-scale applications

Bibliography

Bibliography for lesson 5 (1)


1

T. F. Coleman and J. J. Mor


e,
Estimation of sparse Jacobian matrices and graph coloring problems,
SIAM Journal on Numerical Analysis, 20:187-209, 1983.
T. F. Coleman and J. J. Mor
e,
Estimation of sparse Hessian matrices and graph coloring problems,
Mathematical Programming, 28:243-270, 1984.
A. R. Conn, N. I. M. Gould and Ph. L. Toint,
LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A),
Springer Verlag, Springer Series in Computational Mathematics 17, Heidelberg, 1992.
A. Curtis, M. J. D. Powell and J. Reid,
On The Estimation of Sparse Jacobian Matrices,
IMA Journal, 13:117-119, 1974.
M. Fisher,
Minimization Algorithms for Variational Data Assimilation,
in Recent Developments in Numerical Methods for Atmospheric Modelling, European Center for Medium-Range
Weather Forecasts, Reading, UK, pp. 364-385, 1998.
T. Frese, Ch. Bouman and K. Sauer,
Multiscale Bayesian Methods for Discrete Tomography,
in Discrete Tomography: Foundations, Algorithms and Applications (G. Herman and A. Kuba, eds.), Birkhauser,
Boston, pp. 237-261, 1999.
D. Goldfarb and Ph. L. Toint,
Optimal Estimation of Jacobian and Hessian Matrices That Arise in Finite Difference Calculations,
Mathematics of Computation, 43(167):69-88, 1984.
S. Gratton, M. Mouffe, Ph. L. Toint and M. Weber-Mendonca,
A recursive trust-region method in infinity norm for bound-constrained nonlinear optimization,
IMA Journal of Numerical Analysis, 28(4):827-861, 2008.
S. Gratton, M. Mouffe, A. Sartenaer, Ph. L. Toint and D. Tomanos,
Numerical Experience with a Recursive Trust-Region Method for Multilevel Nonlinear Optimization,
Optimization Methods and Software, to appear, 2009.
Philippe Toint (Namur)

April 2009

272 / 323

The use of problem structure for large-scale applications

Bibliography

Bibliography for lesson 5 (2)


10 S. Gratton, A. Sartenaer and Ph. L. Toint,
Recursive Trust-Region Methods for Multiscale Nonlinear Optimization, SIAM Journal on Optimization,
19(1):414-444, 2008.
11 S. Gratton, A. Sartenaer and Ph. L. Toint,
Second-order convergence properties of trust-region methods using incomplete curvature information, with an
application to multigrid optimization,
Journal of Computational and Applied Mathematics, 24(6):676-692, 2006.
12 S. Gratton and Ph. L. Toint,
Multi-Secant Equations, Approximate Invariant Subspaces and Multigrid Optimization,
FUNDP, Namur, Report 07/11, 2007.
13 A. Griewank and Ph. L. Toint,
On the unconstrained optimization of partially separable functions,
in Ninlinear Optimization 1981, (M. J. D. Powell, ed.), Academic Press, pp. 302-312, 1982.
14 S. G. Nash,
A Multigrid Approach to Discretized Optimization Problems,
Optimization Methods and Software, 14:99-116, 2000.
15 M. Lewis and S. G. Nash,
Model problems for the multigrid optimization of systems governed by differential equations,
SIAM Journal on Scientific Computing, 26(6):1811-1837, 2005.
16 S. Oh, A. Milstein, Ch. Bouman and K. Webb,
Multigrid algorithms for optimization and inverse problems,
in Computational Imaging(Ch. Bouman and R. Stevenson, eds.), DDM, Proceedings of the SPIE, 5016:59-70, 2003.
17 M. J. D. Powell and Ph. L. Toint,
On The Estimation of Sparse Hessian Matrices,
SIAM Journal on Numerical Analysis, 16(6):1060-1074, 1979.
18 Ph. L. Toint, D. Tomanos and M. Weber-Mendonca,
A multilevel algorithm for solving the trust-region subproblem,
Optimization Methods and Software, 24(2):299-311, 2009.
19 Ch. Gross and R. Krause,
On The Convergence of Recursive Trust-Region Methods for Multiscale Nonlinear Optimization and Applications to
Nonlinear Mechanics,
University of Bonn, Germany, 2008.
Philippe Toint (Namur)

April 2009

273 / 323

The use of problem structure for large-scale applications

Bibliography

Bibliography for lesson 5 (3)

20 Z. Wen and D. Goldfarb,


A Linesearch Multigrid Methods for Large-Scale Convex Optimization,
Department of Industrial Engineering and Operations Research, Columbia University, New York, July 2007.

Philippe Toint (Namur)

April 2009

274 / 323

Regularization methods and nonlinear step control

Lesson 6:
Cubic and quadratic
regularization methods:
a path towards
nonlinear step control
Philippe Toint (Namur)

April 2009

275 / 323

Regularization methods and nonlinear step control

Outline

Regularization for unconstrained problems


1
2

cubic
quadratic

Nonlinear step control

Cubic regularization for constrained problems

Conclusions

Bibliography

Philippe Toint (Namur)

April 2009

276 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Regularization techniques
for unconstrained Problems

Philippe Toint (Namur)

April 2009

277 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

The problem

We return to the unconstrained nonlinear programming problem:


minimize f (x)
for x IRn and f : IRn IR smooth.
Important special case: the nonlinear least-squares problem
minimize f (x) = 12 kF (x)k2
for x IRn and F : IRn IRm smooth.

Philippe Toint (Namur)

April 2009

278 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Unconstrained optimization a mature area?


f (x) where f C 1
minimize
n
xIR

(maybe

C2 )

Currently two main competing (but similar) methodologies


Linesearch methods
compute a descent direction sk from xk
set xk+1 = xk + k sk to improve f

Trust-region methods
compute a step sk from xk to improve a model mk of f
within the trust-region ksk k
set xk+1 = xk + sk if mk and f agree at xk + sk
otherwise set xk+1 = xk and reduce the radius

Philippe Toint (Namur)

April 2009

279 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

A useful theoretical observation


Consider trust-region method where
model = true objective function
Then
model and objective always agree
trust-region radius goes to infinity
a linesearch method
Nice consequence:

A unique convergence theory!


(Shultz/Schnabel/Byrd, 1985, T., 1988

Philippe Toint (Namur)

April 2009

280 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

The keys to convergence theory for trust regions


The Cauchy condition:


kgk k
, k
mk (xk ) mk (xk + sk ) TR kgk k min
1 + kHk k

The bound on the stepsize:


ksk
And we derive:
Global convergence to first/second-order critical points
Is there anything more to say?
Philippe Toint (Namur)

April 2009

281 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Is there anything more to say?


Observe the following: if
f has gradient g and globally Lipschitz continuous Hessian H with
constant 2L
Taylor, Cauchy-Schwarz and Lipschitz imply
f (x + s)

= f (x) + hs, g (x)i + 12 hs, H(x)si


R1
+ 0 (1 )hs, [H(x + s) H(x)]si d
f (x) + hs, g (x)i + 12 hs, H(x)si + 13 Lksk32
|
{z
}
m(s)

= reducing m from s = 0 improves f since m(0) = f (x).

Philippe Toint (Namur)

April 2009

282 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

The cubic regularization


Change from
f (x) + hs, g (x)i + 12 hs, H(x)si s.t. ksk

min
s

to
min
s

f (x) + hs, g (x)i + 12 hs, H(x)si + 13 ksk3

is the (adaptive) regularization parameter


(ideas from Griewank, Weiser/Deuflhard/Erdmann, Nesterov/Polyak, Cartis/Gould/T)

Philippe Toint (Namur)

April 2009

283 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Cubic regularization highlights


f (x + s) m(s) f (x) + s T g (x) + 21 s T H(x)s + 13 Lksk32

Nesterov and Polyak minimize m globally


N.B. m may be non-convex!
efficient scheme to do so if H has sparse factors

global (ultimately rapid) convergence to a 2nd-order critical point of f


better worst-case function-evaluation complexity than previously
known
Obvious questions:
can we avoid the global Lipschitz requirement?
can we approximately minimize m and retain good worst-case
function-evaluation complexity?
does this work well in practice?
Philippe Toint (Namur)

April 2009

284 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Cubic overestimation
Assume
f C2
f , g and H at xk are fk , gk and Hk
symmetric approximation Bk to Hk
Bk and Hk bounded at points of interest
Use
cubic overestimating model at xk
mk (s) fk + s T gk + 12 s T Bk s + 31 k ksk32
k is the iteration-dependent regularisation weight
p
easily generalized for regularisation in Mk -norm kskMk = s T Mk s
where Mk is uniformly positive definite

Philippe Toint (Namur)

April 2009

285 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Adaptive Regularization with Cubic (ARC)


Algorithm 6.1: The ARC Algorithm
Step 0: Initialization: x0 and 0 > 0 given. Set k = 0
Step 1: Step computation: Compute sk for which mk (sk ) mk (skC )
Cauchy point: skC = kC gk & kC = arg min mk (gk )
IR+

f (xk ) f (xk + sk )
Step 2: Step acceptance: Compute k =
f (xk ) mk (sk )

x k + sk
if k > 0.1
and set xk+1 =
xk
otherwise
Step 3: Update the regularization parameter:

k+1
(0, k ] = 21 k if k > 0.9
very successful

[ , 1 k ] = k if 0.1 k 0.9 successful


k
[1 k , 2 k ] = 2k otherwise
unsuccessful
Philippe Toint (Namur)

April 2009

286 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Local convergence theory for cubic regularization (1)


The Cauchy condition:

kgk k
mk (xk ) mk (xk + sk ) CR kgk k min
,
1 + kHk k

kgk k
k

The bound on the stepsize:

kHk k
ksk k 3 max
,
k

kgk k
k

(Cartis/Gould/T)

Philippe Toint (Namur)

April 2009

287 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Local convergence theory for cubic regularization (2)


And therefore. . .
lim kgk k = 0

first-order global convergence


Under stronger assumptions can show that
If sk minimizes mk over subspace with orthogonal basis Qk ,
lim QkT Hk Qk  0

second-order global convergence


Philippe Toint (Namur)

April 2009

288 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Fast convergence
For fast asymptotic convergence = need to improve on Cauchy point:
minimize over Krylov subspaces
1
g stopping-rule: ks mk (sk )k min(1, kgk k 2 )kgk k
s stopping-rule: ks mk (sk )k min(1, ksk k )kgk k
If Bk satisfies the Dennis-More condition
k(Bk Hk )sk k/ksk k 0 whenever kgk k 0
and xk x with positive definite H(x )
= Q-superlinear convergence of xk under the g- and s-rules
If additionally H(x) is locally Lipschitz around x and
k(Bk Hk )sk k = O(ksk k2 )
=

Q-quadratic convergence of xk under the s-rule

Philippe Toint (Namur)

April 2009

289 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Function-evaluation complexity
How many function evaluations (iterations) are needed to ensure that
kgk k ?
so long as for very successful iterations k+1 3 k for 3 < 1
= basic ARC algorithm requires at most
l m
C function evaluations
2
for some C independent of 
c.f. steepest descent
if H is globally Lipschitz, the s-rule is applied and additionally sk is
the global (line) minimizer of mk (sk ) as a function of
= ARC algorithm requires at most
l
m
S
function evaluations
3/2

for some S independent of 
Philippe Toint (Namur)

c.f. Nesterov & Polyak


April 2009

290 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Minimizing the model


m(s) f + s T g + 12 s T Bs + 13 ksk32

Derivatives:
= ksk2
s m(s) = g + Bs + s

ss m(s) = B + I +

s
ksk



s
ksk

T

Optimality: any global minimizer s of m satisfies


(B + I )s = g
= ks k2
B + I is positive semi-definite
Philippe Toint (Namur)

April 2009

291 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

The (adapted) secular equation


Require
(B + I )s = g

and

= ksk2

Define s():
(B + I )s() = g
and find scalar as the root of secular equations
=0
ks()k2

or

1
=0
ks()k2

or

=0
ks()k2

values and derivatives of s() satisfy linear systems with symmetric


positive definite B + I
need to be able to factorize B + I

Philippe Toint (Namur)

April 2009

292 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Plots of secular functions against


Example: g = (0.25 1)T , H = diag(1 1) and = 2

ks()k2

=0

Philippe Toint (Namur)

=0
ks()k2

=0
ks()k2

April 2009

293 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Large problems approximate solutions


Seek instead global minimizer of m(s) in a j-dimensional (j  n) subspace
S IRn
g S = ARC algorithm globally convergent
Q orthogonal basis for S

s = Qu where

u = arg minj f + u T (Q T g ) + 12 u T (Q T BQ)u + 31 kuk32


uIR
= use secular equation to find u
if S is the Krylov space generated by {B i g }j1
i=0
= Q T BQ = T , tridiagonal
= can factor T + I to solve secular equation even if j is large
using g- or s-stopping rule = fast asymptotic convergence for ARC
using s-stopping rule = good function-evaluation complexity for
ARC
Philippe Toint (Namur)

April 2009

294 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

The main features of adaptive cubic regularization

And the result is. . .


longer steps on ill-conditioned problems
similar (very satisfactory) convergence analysis
best function-evaluation complexity for nonconvex problems
excellent performance and reliability

Philippe Toint (Namur)

April 2009

295 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Numerical experience small problems using Matlab


Performance Profile: iteration count 131 CUTEr problems
1

fraction of problems for which method within of best

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2
ACO g stopping rule (3 failures)
ACO s stopping rule (3 failures)
trustregion (8 failures)

0.1

1.5

Philippe Toint (Namur)

2.5

3.5

4.5

April 2009

296 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

The quadratic regularization for NLS (ARQ)


Consider the Gauss-Newton method for nonlinear least-squares problems.
Change from

min
s

1
2

kc(x)k2 + hs, J(x)T c(x)i + 12 hs, J(x)T J(x)si s.t. ksk

to

min
s

kc(x) + J(x)sk + 12 ksk2

is the (adaptive) regularization parameter


(idea by Nesterov)
Philippe Toint (Namur)

April 2009

297 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Quadratic regularization: reformulation


Note that
min
s

kc(x) + J(x)sk + 12 ksk2

min
,s

+ 12 ksk2 such that kc(x) + J(x)sk2 = 2

exact penalty function for the problem of minimizing ksk subject to


c(x) + J(x)s = 0.
Iterative techniques. . . as for the cubic case (Cartis, Gould,T.):
solve the problem in nested Krylov subspaces
Lanczos factorization of tridiagonal matrices
different scalar secular equation (solution by Newtons method)
Philippe Toint (Namur)

April 2009

298 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

The keys to convergence theory for quadratic regularization

The Cauchy condition:




kJkT ck k
kJkT ck k
kJkT ck k
,
m(xk ) m(xk + sk ) QR
min
kck k
1 + kJkT Jk k k kck k
The bound on the stepsize:
ksk k

Philippe Toint (Namur)

1 kJkT ck k
2 k kck k

April 2009

299 / 323

Regularization methods and nonlinear step control

Regularization methods for unconstrained problems

Convergence theory for the quadratic regularization


Convergence results:
Global convergence to first-order critical points

Quadratic convergence to roots


Valid for
general values of m and n,
exact/approximate subproblem solution
(Bellavia/Cartis/Gould/Morini/T.)

Philippe Toint (Namur)

April 2009

300 / 323

Regularization methods and nonlinear step control

Nonlinear stepsize control

6.2: A unifying concept:


nonlinear stepsize control

Philippe Toint (Namur)

April 2009

301 / 323

Regularization methods and nonlinear step control

Nonlinear stepsize control

Towards a unified global convergence theory

Objectives:
recover a unified global convergence theory
possibly open the door for new algorithms
Idea:
cast all three methods into a generalized TR framework
allow this TR to be updated nonlinearly

Philippe Toint (Namur)

April 2009

302 / 323

Regularization methods and nonlinear step control

Nonlinear stepsize control

Towards a unified global convergence theory (2)


Given
3 continuous first-order criticality measures (x), (x), (x)
an adaptive stepsize parameter
define a generalized radius (, (x)) such that
(, ) is C 1 , strictly increasing and concave,
(0, ) = 0 for all ,
(, ) is non-increasing
(, ) (, )

...

Philippe Toint (Namur)

April 2009

303 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

6.3: Cubic regularization


for constrained problems

Philippe Toint (Namur)

April 2009

304 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

The constrained case


Can we apply regularization to the constrained case?
Consider the constrained nonlinear programming problem:
minimize f (x)
x F
for x IRn and f : IRn IR smooth, and where
F is convex.
Main ideas:
exploit (cheap) projections on convex sets
define using the generalized Cauchy point idea
prove global convergence + function-evaluation complexity
Philippe Toint (Namur)

April 2009

305 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

Constrained step computation (1)

min
s

f (x) + hs, g (x)i + 12 hs, H(x)si + 13 ksk3

subject to
x +s F

is the (adaptive) regularization parameter


Criticality measure: (as before)





(x) =
min
hx f (x), di ,
x+dF ,kdk1
def

Philippe Toint (Namur)

April 2009

306 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

The generalized Cauchy point for ARC


Cauchy step: Goldstein-like piecewise linear seach on mk along the
gradient path projected onto F
Find

def

xkGC = PF [xk tkGC gk ] = xk + skGC

(tkGC > 0)

such that
mk (xkGC ) f (xk ) + ubs hgk , skGC i

(below linear approximation)

and either
mk (xkGC ) f (xk ) + lbs hgk , skGC i

(above linear approximation)

or
kPT (x GC ) [gk ]k epp |hgk , skGC i|
k

(close to paths end)

no trust-region condition!
Philippe Toint (Namur)

April 2009

307 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

Searching for the ARC-GCP


5
4
3
2
1
0
1
2
3
4
5
0

0.2

0.4

0.6

0.8

1.2

1.4

mk (0 + s) = 3.57s1 1.5s2 s3 + s1 s2 + 3s2 + s2 s3 2s3 + 31 ksk


Philippe Toint (Namur)

1.6

such that s 1.5

April 2009

308 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

Remember the same for a quadratic model?


5
4
3
2
1
0
1
2
3
4
5
6
0

0.2

0.4

0.6

0.8

1.2

1.4

1.6

mk (0 + s) = 3.57s1 1.5s2 s3 + s1 s2 + 3s2 + s2 s3 2s3 such that s 1.5 and 2.8

Philippe Toint (Namur)

April 2009

309 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

A constrained regularized algorithm


Algorithm 6.2: ARC for Convex Constraints (COCARC)
Step 0: Initialization. x0 F, 0 given. Compute f (x0 ), set k = 0.
Step 1: Generalized Cauchy point. If xk not critical, find the
generalized Cauchy point xkGC by piecewise linear search on the
regularized cubic model.
Step 2: Step calculation. Compute sk and xk+ = xk + sk F such
that mk (xk+ ) mk (xkGC ).
def

Step 3: Acceptance of the trial point. Compute f (xk+ ) and k .


If k 1 , then xk+1 = xk + sk ; otherwise xk+1 = xk .

Step 4: Regularisation parameter update. Set

if k 2 ,
(0, k ]
[k , 1 k ]
if k [1 , 2 ),
k+1

[1 k , 2 k ] if k < 1 .
Philippe Toint (Namur)

April 2009

310 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

Local convergence theory for COCARC


The Cauchy condition:


k
,
mk (xk ) mk (xk + sk ) CR k min
1 + kHk k

k
,1
k

The bound on the stepsize:


"

kHk k
ksk k 3 max
,
k

k
k

1  1 #
2
k 3
,
k

And therefore. . .
lim k = 0

(Cartis/Gould/T)
Philippe Toint (Namur)

April 2009

311 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

Function-Evaluation Complexity for COCARC (1)


But
What about function-evaluation complexity?
If, for very successful iterations, k+1 3 k for 3 < 1,
the COCARC algorithm requires at most
l m
C function evaluations
2
(for some C independent of ) to achieve k 
c.f. steepest descent
Do the nicer bounds for unconstrained optimization extend to the
constrained case?
Philippe Toint (Namur)

April 2009

312 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

Function-evaluation complexity for COCARC (2)

As for unconstrained, impose a termination rule on the subproblem


solution:
Do not terminate solving minxk +sF mk (xk + s) before
+
m
k (xk ) min(stop , ksk k) k

where




def
m
k (x) =
min
hx mk (x), di
x+dF ,kdk1
c.f. the s-rule for unconstrained

Note: OK at local constrained model minimizers

Philippe Toint (Namur)

April 2009

313 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

Walking through the pass...


2

1
x g
k

x+

min

2
feasible
3
7

A beyond the pass constrained problem with


m(x, y ) = x
Philippe Toint (Namur)

42
100

3
10

x2

1
10

y 3 + 13 [x 2 + y 2 ] 2
April 2009

314 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

Walking through the pass...with a sherpa


2
x

k,c

1
x

x g

k,a

+
xk

2
feasible
3
7

A piecewise descent path from xk to xk+ on


m(x, y ) = x
Philippe Toint (Namur)

42
100

3
10

x2

1
10

y 3 + 13 [x 2 + y 2 ] 2
April 2009

315 / 323

Regularization methods and nonlinear step control

Regularization techniques for constrained problems

Function-Evaluation Complexity for COCARC (2)


Assume also
xk xk+ in a bounded number of feasible descent substeps
kHk xx f (xk )k ksk k2

xx f () is globally Lipschitz continuous


{xk } bounded
The COCARC algorithm requires at most


C function evaluations
3/2
(for some C independent of ) to achieve k 
Caveat: cost of solving the subproblem
Philippe Toint (Namur)

c.f. unconstrained case!!!


April 2009

316 / 323

Regularization methods and nonlinear step control

Conclusions

Conclusions for lesson 6

Much left to do. . . but very interesting


Unconstrained nonliear stepsize control could lead to very untypical
methods. Example:
p
k = k = k = kgk k,
(, ) =
Meaningful numerical evaluation still needed for many of these
algorithms
Many issues regarding regularizations still unresolved

Philippe Toint (Namur)

April 2009

317 / 323

Regularization methods and nonlinear step control

Bibliography

Bibliography for lesson 6


R. H. Byrd, R. B. Schnabel and G. A. Shultz,
A trust region algorithm for nonlinearly constrained optimization,
SIAM Journal on Numerical Analysis, 24: 11521170, 1987.
Ph. L. Toint,
Global convergence of a class of trust region methods for nonconvex minimization in Hilbert space,
IMA Journal of Numerical Analysis, 8(2): 231252, 1988.
A. Griewank,
The modification of Newtons method for unconstrained optimization by bounding cubic terms,
Department of Applied Mathematics and Theoretical Physics, University of Cambridge (UK), Report NA/12, 1981.
M. Weiser, P. Deuflhard and B. Erdmann,
Affine conjugate adaptive Newton methods for nonlinear elastomechanics,
Optimization Methods and Software, 22(3): 413431, 2007.
Yu. Nesterov and B. T. Polyak,
Cubic regularization of Newton method and its global performance,
Mathematical Programming, 108(1): 177-205, 2006.
C. Cartis and N. I. M. Gould and Ph. L. Toint,
Adaptive cubic overestimation methods for unconstrained optimization,
FUNDP, Namur, Report 07/05, 2007.
C. Cartis, N. I. M. Gould and Ph. L. Toint,
Trust-region and other regularisation of linear least-squares problems,
BIT, to appear, 2009.
Yu. Nesterov,
Modified Gauss-Newton scheme with worst-case guarantees for global performance,
Optimization Methods and Software, 22(3): 469483, 2007.
S. Bellavia, C. Cartis, N. I. M. Gould, B. Morini and Ph. L. Toint,
Convergence of a Regularized Euclidean Residual Algorithm for Nonlinear Least-Squares,
FUNDP, Namur, Report 08/11, 2008.
C. Cartis, N. I. M. Gould and Ph. L. Toint,
An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its
function-evaluation complexity,
FUNDP, Namur, Report 08/05R, 2009.
Philippe Toint (Namur)

April 2009

318 / 323

Conclusions

Not covered in the course

non-smooth techniques
specifically convex problems
penalty functions
augmented Lagrangians
affine scaling methods
general sequential quadratic programming (SQP)
systems of nonlinear equations
...

Many thanks to you all for your patience!

Philippe Toint (Namur)

April 2009

319 / 323

S-ar putea să vă placă și