3513

MA3513 Elementary Numerical Methods
Ya Yan Lu
Department of Mathematics
City University of Hong Kong
Preface
Numerical methods are mathematical methods that give approximate solutions to mathematical prob-
lems based on arithmetic operations involving numbers with nite number of digits. Numerical methods
are usually implemented on computers using oating point numbers. Numerical methods are the
foundation of computational sciences and computational engineering. Computational sciences
including computational physics, computational chemistry and computational biology, perform numer-
ical simulations to understand scientic phenomena, test theories and make predictions. Engineeres
(electrical engineers, mechanical engineers, civil engineers, etc) perform numerical simulations in their
computer-aided design processes. Numerical methods are also needed in the nancial bussiness, where
people want to maximize their prots and minimize their risks.
While numerical methods give approximate solutions, they are dierent from analytic approxima-
tions. In principle, a numerical method should be able to produce a solution with an arbitrarily small
error (i.e., the accuracy can be as high as desired) if sucient computer resources (CPU time and
memory) are available. On the other hand, the error of an analytic approximation is xed. Numerical
computations are also dierent from exact calculations in symbolic algebra softwares, such as Maple,
where the errors are zero. While zero error is nice, exact solutions are only possible for a very limited
number of problems. Even when analytic solutions are avaialble, sometimes they are so complicated that
you would prefer a simpler approximate solution. In any case, numerical methods have a much wider
scope and are extremely important.
Numerical analysis is the subject that studies the mathematical properties of numerical methods.
Such mathematical properties include convergence, stability, etc. In reality, it is impossible to separate
numerical methods from numerical analysis. When a numerical method is introduced, we certainly want
to understand its basic mathematical properties. Scientic computation (or scientic computing)
emphasizes the applications of numerical methods. Computational mathematics is a broad concept
that includes numerical methods, their analysis and applications.
In this course, we cover the most important numerical methods. The rst chapter introduces oating
point numbers and their arithmetic. The second chapter presents some numerical methods for solving
f(x) = 0, where f is an arbitrary function. Chapter 3 deals with approximating a function (or a set of
points from a given frunction) by polynomials or piecewise polynomials. Chapter 4 is about evaluating
denite integrals of arbitrary functions. Chapter 5 is a little special, it is related Fourier transform and
Fourier series. Chapters 6, 7 and 8 are numerical methods for linear algebra. We present numerical
methods for solving linear systems, least squares problems and eigenvalue problems. The last chapter is
about initial value problems of ordinary dierential equations, but it covers only a few simple numerical
methods. This course is followed by a course on Numerical Methods for Dierential Equations,
where more numerical methods for initial value problems will be presented, but it will also cover boundary
value problems and numerical methods for partial dierential equations.
1
Contents
1 Computer arithmetic 4
1.1 Floating point numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Loss of precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Numerical instability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Nonlinear Equations 9
2.1 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Newtons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Nine additional methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Polynomial and Spline Interpolation 15
3.1 Polynomial interpolation, Lagrange formula . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Cubic spline interpolations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Numerical Integration 24
4.1 Trapezoidal rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 The Simpsons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Adaptive Simpsons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Gauss-Legendre quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Fast Fourier Transform 33
5.1 Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Matrix Factorization for FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Derivative based on Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Linear Equations 41
6.1 Triangular system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 LU decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 LU decomposition with partial pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4 Matrix norm and condition number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7 QR factorization and Least Squares Problems 50
7.1 The least squares problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.2 Householder reection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3 QR factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.4 Solving least squares problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2
8 Matrix Eigenvalue Problem 57
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.2 Power, inverse power and Rayleigh quotient iterations . . . . . . . . . . . . . . . . . . . . 58
8.3 Orthogonal reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.4 The QR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9 Runge-Kutta Methods 64
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9.2 Euler and Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.3 Local truncation error and order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.4 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3
Chapter 1
Computer arithmetic
In this chapter, we introduce the concept of oating point numbers. A number of examples are used to
illustrate possible wrong answers computed from correct mathematical formulas. These examples are
related to the fact that the oating point number have only nite number of digits. This implies that
subtration between nearly equal numbers leads to loss of signicance. If some calculations involve
many steps, numerical instability may happen. This means that unavoidable small errors may increase
rapidly following the steps.
1.1 Floating point numbers
The oating point numbers used in a programming language such as C, C++ or FORTRAN usually
have a single or double precision. A single precision oating point number (i.e. oat in C) has 4 bytes
or 32-bits, arranged in the following
s e
7
e
6
...e
1
e
0
m
1
m
2
...m
23
,
where s, e
0
, ..., e
7
, m
1
, ..., m
23
are 1 or 0. This is dened (in the so-called IEEE standard 754) as
x = (1)
s
_
1 +
m
1
2
+
m
2
2
2
+... +
m
23
2
23
_
2
q
(1.1)
where q is an integer. More precisely
q = (e
7
e
6
...e
0
)
2
127 = e
0
+ 2e
1
+ 2
2
e
2
+... + 2
7
e
7
127.
We can also use the binary number notation:
(1.m
1
m
2
...m
23
)
2
= 1 +
m
1
2
+
m
2
2
2
+... +
m
23
2
23
.
Thus,
x = (1.m
1
m
2
...m
23
)
2
2
q
. (1.2)
This is the usual oating point number. The number dened above has 24 meaningful bits (including
the rst 1). There are also some unusual oating point numbers given at the end of this section.
Clearly, most real numbers are not oating point numbers as dened above. For example, 0.1 is not
a oating point number. In fact,
0.1 = 1.6 2
4
= (1.m
1
m
2
...m
23
)
2
2
4
4
for any choices of m
1
, m
2
, ..., m
23
. This is so, because
(0.m
1
m
2
...m
23
)
2
=
p
2
23
for some integer p. If p/2
23
= 0.6 = 3/5, we have 5p = 3 2
23
. But that is impossible, because the
prime factor 5 is not available in the other side.
In a programming language like C, if we try to input a number like 0.1, what we actually get is the
nearby oating point. Let x be a real number, we denote by fl(x) the nearest oating point number.
The process from x to fl(x) is called rounding. When fl(x) is used as an approximation to x, we have
some small error. This error is called the round-o error. We dene
absolute error = |x fl(x)|, relative error =
x fl(x)
x
.
Theorem 1 For the single precision oating point number system dened in (1.1), if x is a real number
and fl(x) is the nearest oating point number, then
|x fl(x)|
|x|

1
2
24
=
M
. (1.3)
Note: We usually call
M
= 1/2
24
the machine epsilon, or unit round-o. A alternative form for the
above theorem is:
fl(x) = x(1 + ) for ||
M
. (1.4)
When we try to add two oating point numbers together, the exact result is usually not a oating
point number. More precisely, assume x and y are oating point numbers, thus, x = fl(x) and y = fl(y),
then x+y usually is not a oating point number. We assume that our computer will do its best. That is,
it will give us fl(x+y). We have the following Fundamental Assumption on Floating Point Arithmetics:
Assumption: Let x and y be oating point numbers, denotes +, , or , then
Computer Result of x y = fl(x y). (1.5)
In other words, the computer result is the nearest oating point number of the exact answer. Therefore,
after every arithmetic operation, we have a round-o error. When the round-o errors are large compared
with the nal answer, the computer results are often meaningless. This happens when e
21
is calculate
by its Taylor expansion. It also happens when we solve the quadratic equation ax
2
+ bx + c = 0 using
the standard formula, if b
2
is much larger than 4ac.
Appendix 1 Unusual oating point numbers:
If e
7
= e
6
= ... = e
0
= m
1
= m
2
= ...m
23
= 0, the number is dened as 0.
If e
7
= e
6
= ...e
0
= 0, but not all m
j
are zero, then the number is dened as
x = (1)
s
(0.m
1
m
2
...m
23
)
2
2
126
Notice that there is no 1 in front of the .m
1
m
2
...m
23
and 126 is not 0 127. These numbers
are small, but less accurate (less than 24 meaningful bits). Furthermore, the denition here is also
consistent with the denition of 0.
If e
7
= e
6
= ... = e
0
= 1 and m
1
= m
2
= ... = m
23
= 0, we have
x = (1)
2
5
If e
7
= e
6
= ... = e
0
= 1 and not all m
j
are zero, then it is not dened. You will get NaN
not a number.
The smallest positive real number is 2
149
and the largest positive real number is (2 2
23
) 2
127
.
Appendix 2 The precise rounding process: We start with the binary expansion of x:
x = (1.m
1
m
2
m
3
...m
23
m
24
m
25
...)
2
2
q
where q is some integer and m
j
= 0 or 1. Now, fl(x) is the oating point number closest to x.
fl(x) = (1.m
1
m
2
m
3
...m
23
)
2
2
q
if m
24
= 0, and
fl(x) = [(1.m
1
m
2
m
3
...m
23
)
2
+ 2
23
] 2
q
if m
24
= 1 and for some k > 24, m
k
= 1. The case when m
24
= 1 and m
25
= m
26
= m
27
= ... = 0 needs
special consideration. The details depend on m
23
. For m
23
= 0, if
x = (1.m
1
m
2
m
3
...m
22
01)
2
2
q
then
fl(x) = (1.m
1
m
2
m
3
...m
22
0)
2
2
q
.
For m
23
= 1, if
x = (1.m
1
m
2
m
3
...m
22
11)
2
2
q
then
fl(x) = [(1.m
1
m
2
m
3
...m
22
1)
2
+ 2
23
] 2
q
.
1.2 Loss of precision
In this section, we consider two examples which all indicate that loss of precision due to the use of
oating point numbers. In the rst example, we calculate
x = 1 +
1
2
n
, n = 0, 1, 2, 3, 4, ...
and check the condition x > 1. It is found is that when n 53, we have x = 1 in MATLAB. This is
so, because MATLAB uses double precision oating point numbers. You can only keep 15 or 16 digits.
Here is a simple MATLAB program for this:
% This function calculate the smallest integer n,
% such that 1 + 1/2^n = 1 in MATLAB.
function n = onepesp
y = 1;
n = 0;
while 1 + y > 1
y = y/2;
n = n+1;
end
6
Next, we consider a program for evaluating e
x
by its Taylor expansion:
e
x
= 1 +x +
x
2
2
+
x
3
3!
+
x
4
4!
+...
We have the following MATLAB program:
% This function calculates exp(x) by its Taylor expansion
% exp(x) = 1 + x + x^2/2! + x^3/3! + x^4/4! + ...
function z = exptay(x)
z = 1;
zold = 0;
y = 1;
n=0;
while z ~= zold
n = n+1;
y = y *x/n
zold = z;
z = z + y;
end
For x = 21, the exact value is
e
21
7.582560427911907 10
10
but the above program gives 3.1649 10
9
. The variable y represents x
n
/n!, we have listed y for
various n, you can see that for 19 n 22, we have |y| > 10
8
. Since we can only have about 16 digits
of accuracy. The error in these values of y can be very signicant compared with the exact value of e
21
(which is less than 10
9
). Thus, the error in y will ruin the answer.
1.3 Numerical instability
If you have to do a sequence of calculations by the oating point numbers, the rounding errors will
accumulate. If the total number of operations is N and the error in the nal answer is O(N
p
M
) for a
small integer p, then it is usually acceptable. However, there are cases where the errors grow like a
N
for
|a| > 1. This type of exponential growth of the error will make the computed results meaningless for
quite small N. We call this type of strong growth of errors numerical instability.
We consider the sequence
a
n
=
1
3
n
.
It is easy to verify that the sequence satises the linear recurrence relationship:
a
n
=
2
3
a
n2
5
3
a
n1
, n = 3, 4, ...
This leads to the following program:
% This function calculates a(n) = 1/3^n by a the
% recurrence a(n) = (2/3) * a(n-2) - (5/3)*a(n-1)
function [a, a1] = unstab(nmax)
7
a(1) = 1/3;
a(2) = 1/9;
for n=3:nmax
a(n) = (2/3) * a(n-2) - (5/3)*a(n-1);
end
% We also calculate a(n) directly, call it a1.
for n=1:nmax
a1(n) = 1/3^n;
end
For small n, the numerical results obtained from the recurrence formula is accurate. But for larger n,
the answer is completely wrong. For example, we have
1
3
67
= 1.0786... 10
32
, but a(67) = 3.27...
To understand this, we can solve the linear recurrence relationship. The general solution is
a
n
= A
1
3
n
+B(2)
n
In the MATLAB program, we set a(1) = 1/3 and a(2) = 1/9, thus, we should get A = 1 and B = 0, but
since the numerical computations use nite precision, we will always get some small round o error.
This suggests that B cannot be exactly zero. Then the term B(2)
n
will dominate for large n.
8
Chapter 2
Nonlinear Equations
In this chapter, we study some numerical methods for solving
f(x) = 0
where f is a function of x. We will present three basic methods
bisection method,
Newtons method,
secant method.
In MATLAB, there is a function called fzero which can be used to solve f(x) = 0.
2.1 Bisection method
To nd x
such that f(x
) = 0, the bisection method starts with two numbers a and b, assuming f(a)
and f(b) have opposite signs. Under the assumption that f is continuous, we know that the interval
(a, b) must contain a zero of f. Now, we consider the mid-point c of the interval (a, b),
c =
a +b
2
,
and nd f(c). If f(c) = 0, we nd the solution. Otherwise, either (a, c) or (c, b) must contain the zero of
f. This can be decided by comparing the sign of f(c) with that of f(a) or f(b). If f(c) has the same sign
as f(a), then f(c) must have an opposite sign as f(b), therefore, the interval (c, b) contains a zero of f.
In this case, we dene the new a as c (and keep b unchanged), then repeat the process. Similarly, if f(c)
has the same sign as f(b), we will replace the original interval (a, b) by (a, c), because (a, c) contains a
zero of f. In this case, we will dene the new b as c and repeat the process.
In each step, we have an interval containing the zero of f, then determins an interval of half size that
still contains the zero of f. The process is repeated, untill the length of the interval is small enough.
The mid-point of the intervals serve as our approximate solutions.
If x
1
is the middle point of [a, b], x
2
is the middle point of the next half interval, etc., then
|x
n
x
|
b a
2
n
.
Following is a simple MATLAB program (saved in le bisection.m) for the bisection method:
9
function c = bisection(a,b,tiny)
% you need a MATLAB function for f(x). This program
% finds a zero of f, with given a and b, such that
% f(a) and f(b) have different signs.
% tiny is a small number to control the accuracy.
fa = f(a);
fb = f(b);
c = (b+a)/2;
while abs(a-b) > tiny,
c = (b+a)/2
if c==a | c==b
return;
end
fc = f(c);
if fc==0
return;
end
if sign(fc) ~= sign(fb)
a=c;
fa = fc;
else
b=c;
fb = fc;
end
end
Now, suppose we want to solve
f(x) = x cos x = 0.
We must rst write a MATLAB function f.m.
function z = f(x)
z = x - cos(x);
Since f(0) = 1 < 0 and f(1) = 1 cos(1) 0.45970 > 0. We can let
a = 0, b = 1.
If we choose tiny = 1.0E-9, the MATLAB program gives the following output:
c = 0.50000000000000
c = 0.75000000000000
c = 0.62500000000000
c = 0.68750000000000
c = 0.71875000000000
c = 0.73437500000000
c = 0.74218750000000
c = 0.73828125000000
c = 0.74023437500000
c = 0.73925781250000
10
c = 0.73876953125000
c = 0.73901367187500
c = 0.73913574218750
c = 0.73907470703125
c = 0.73910522460938
c = 0.73908996582031
c = 0.73908233642578
c = 0.73908615112305
c = 0.73908424377441
c = 0.73908519744873
c = 0.73908472061157
c = 0.73908495903015
c = 0.73908507823944
c = 0.73908513784409
c = 0.73908510804176
c = 0.73908512294292
c = 0.73908513039351
c = 0.73908513411880
c = 0.73908513225615
c = 0.73908513318747
The bisection method is very reliable, but it is not very ecient. When we set tiny = 10
9
, we need 30
steps.
2.2 Newtons method
Newtons method is a very ecient method for solving f(x) = 0. But, it is necessary to know the
derivative of f(x). You also need to provide an initial guess of the exact solution. If the initial guess is
not good, the method may fail. Therefore, Newtons method is not as reliable as the bisection method.
But, it is much more ecient than bisection.
Starting from an initial guess x
0
, Newtons method obtains x
1
, x
2
, ..., recursively by the formula:
x
k+1
= x
k
f(x
k
)
f
(x
k
)
. (2.1)
This gives rise to a sequence of approximate solutions and we hope limx
k
= x
as k . In the
xy-plane, we notice that x
k+1
is the intersection of the tangent line of y = f(x) at (x
k
, f(x
k
)) with the
x-axis.
If we assume that x
is the exact solution such that f(x
) = 0 and f
(x
) = 0 and f has a second

order derivative, then we can show that
x
k+1
x
(x
)
2f
(x
)
(x
k
x
)
2
for large k. More precisely, we have
lim
k
x
k+1
x
(x
k
x
)
2
=
f
(x
)
2f
(x
)
This means that Newtons method has a quadratic convergence in general. As a result, if x
k
has 10
correct digits, we expect x
k+1
to have about 20 correct digits, x
k+2
to have about 40 correct digits, etc.
11
We apply Newtons method to nd an approximation of
2 by solving
f(x) = x
2
2 = 0.
Since f
(x) = 2x, we have

x
k+1
= x
k
f(x
k
)
f
(x
k
)
= x
k
x
2
k
2
2x
k
=
x
k
2
+
1
x
k
We do this calculation in Maple with 45 digits and the initial guess is x
0
= 1:
> Digits:=45;
Digits := 45
> x:=1.0;
x := 1.0
> x := x/2 + 1/x;
x := 1.50000000000000000000000000000000000000000000
^^
> x := x/2 + 1/x;
x := 1.41666666666666666666666666666666666666666667
^^^^
> x := x/2 + 1/x;
x := 1.41421568627450980392156862745098039215686275
^^^^^^^
> x := x/2 + 1/x;
x := 1.41421356237468991062629557889013491011655962
^^^^^^^^^^^^^
> x := x/2 + 1/x;
x := 1.41421356237309504880168962350253024361498193
^^^^^^^^^^^^^^^^^^^^^^^^^
> x := x/2 + 1/x;
x := 1.41421356237309504880168872420969807856967188
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> sqrt(2.0);
1.41421356237309504880168872420969807856967188
The fast convergence of Newtons method is a result of its quadratic convergence. That is:
e
k+1
e
2
k
, where
e
k
= |x
k
x
|, e
k+1
= |x
k+1
x
|, =
(x
)
2f
(x
.
In general, if e
n
tends to zero as n tends to innity, we are interested in how e
n+1
varies as a function
of e
n
. Therefore, we want to nd some constant and , such that
e
n+1
e
n
.
Here, is the order of convergence. For = 1, 2 and 3, we call this linear, quadratic and cublic
convergence, respectively. If = 1, you need the extra condition that || < 1 for convergence. But if
> 1, we can still have a convergence (i.e, e
n
0, as n ) if || > 1.
12
2.3 Secant method
The secant method can be applied when the derivative of f(x) is not available. However, it needs two
initial guesses. If the initial guesses are not good, the method may fail. Therefore, the secant method
(like Newtons method) is not as reliable as the bisection method. But it is also much more ecient
than the bisection method.
Starting fromx
0
, x
1
, to nd x
such that f(x
) = 0, we obtain the sequence (of approximate solutions)

by:
x
k+1
= x
k
x
k
x
k1
f(x
k
) f(x
k1
)
f(x
k
). (2.2)
In the xy-plane, x
k+1
is the intersection of the straight line connecting (x
k
, f(x
k
)) and (x
k1
, f(x
k1
))
with the x-axis. Alternatively, we can think of the secant method as replacing f
(x
k
) by
f(x
k
) f(x
k1
)
x
k
x
k1
in Newtons method.
For the secant method, using Taylor expansion at x
, the following relationship can be obtained:

x
k+1
x
(x
)
2f
(x
)
(x
k
x
)(x
k1
x
).
This leads to:
|x
k+1
x
| |x
k
x
,
where = (1 +
5)/2 1.618 and is some constant related to f
(x
) and f
(x
). This implies that

the order of convergence of the secant method is 1.618. Since the order of convergence of Newtons
method is 2, the secant method is a little bit slower than Newtons method.
Here is a short MATLAB program (saved in secant.m) for the secant method:
function x1 = secant(x0,x1,tiny)
f0 = f(x0);
diff = x1-x0;
while abs(diff/x1) > tiny
f1 = f(x1);
diff = (x1-x0)*f1/(f1-f0);
x0 = x1;
f0 = f1;
x1 = x1 - diff
end
Assuming we are solving f(x) = x cos(x) = 0, then we set x
0
= 0.5, x
1
= 0.6 and tiny = 10
12
. The
results are
x1 = 0.74800665588273
x1 = 0.73879196796329
x1 = 0.73908455831284
x1 = 0.73908513325238
x1 = 0.73908513321516
x1 = 0.73908513321516
Notice that x1 above actually represents x
2
, x
3
, ..., x
7
. It is clear that within 5 iretatives, the program
is terminated, because we have reached the condition:
x
7
x
6
x
7
10
12
.
13
2.4 Nine additional methods
In the previous sections, we have used two conditions, namely, f(x
n
) and f
(x
n
) for Newtons method,
or f(x
n1
) and f(x
n
) for secant method, to determine a straight line that approximates f(x), and then
nd the next iteration x
n+1
. In this section, we introduce methods that use three conditions.
First, we have to decide what are the three conditions. There are three possibilities:
(C1): f(x
n
), f
(x
n
) and f
(x
n
);
(C2): f(x
n1
), f(x
n
) and f
(x
n
);
(C3): f(x
n2
), f(x
n1
) and f(x
n
).
The case (C1) involves just one point, but it requires the rst and second order derivatives. The case
(C3) involves three points and no derivatives. The case (C2) involves two points and the rst order
derivative. Next, we need to choose a simple function to approximate f(x). For Newtons method and
the secant method, the approximate function is a polynomial of x with degree 1. Therefore, we can
naturally use polynomial of degree 2. However, there are other possibilties. We have
(A1): y = a +b(x x
n
) +c(x x
n
)
2
;
(A2): x = a +b[y f(x
n
)] +c[y f(x
n
)]
2
;
(A3): y =
a+b(xxn)
1+c(xxn)
.
The case (A1) is a quadratic polynomial of x. We have shifted the variable, so that the coecient a
is simple to calculate. In fact, we have a = f(x
n
). The case (A2) is an inverse quadratic polynomial.
Here, the quadratic polynomial is a polynomial of y and we consider x as a function of y. It is also
shifted, such that a = x
n
. The case (A3) is a rational function where the degrees for numerator and
denominator are both 1. It is written in the form where y is regarded as a function of x. If we solve x
and consider x as a function of y, then is is also a rational function of same degree. The coecient a is
easily obtained: a = f(x
n
).
Each combination of (Ci) and (Aj), for 1 i, j 3, gives us a method. The function given in (Aj)
is used to satisfy the three conditions in (Ci). This allows us to determine the coecients a, b and c
in (Aj). Set y = 0 in (Aj), we can nd x
n+1
. Therefore, we obtain nine dierent methods for all these
combinations. The case for (C1) and (A1) is the Taylor expansion method. The case for (C3) and (A1)
is the relatively popular M ullers method. When the quadratic polynomial (A1) is used, we have to solve
the equation
0 = f(x
n
) +b(x x
n
) +c(x x
n
)
2
for x
n+1
. However, the quadratic equation has two solutions. We choose x
n+1
to be the one that is
closer to x
n
. The inverse quadratic polynomial (A2) is quite convenient to use. Once a, b and c are
obtained (a = x
n
), then
x
n+1
= x
n
bf(x
n
) +c[f(x
n
)]
2
.
For each of these methods, we can carry out some perturbation analysis asuming x
n
is close to the exact
solution x
. For the M ullers method, we obtain

x
n+1
x
C(x
n
x
)(x
n1
x
)(x
n2
x
)
for some C related to f and its derivatives at x
. This leads to
|x
n+1
x
| |x
n
x
,
where is a solution of
3
2
1 = 0, i.e., 1.8393.
14
Chapter 3
Polynomial and Spline Interpolation
If I give you two points in the xy plane, you can nd a straight line connecting the two points. Similarly,
if three points are given, you can nd a quadratic polynomial to connect the three points. In the rst
section of this chapter, we consider the general case of a polynomial connecting n + 1 points. In the
second section, we consider a piecewise cubic polynomial.
3.1 Polynomial interpolation, Lagrange formula
The problem of polynomial interpolation is:
Given n + 1 points (x
j
, y
j
), j = 0, 1, ..., n, nd a polynomial P
n
(x), with the degree of P
n
n,
such that
P
n
(x
j
) = y
j
, for j = 0, 1, ..., n.
We must require that the x
j
values are dierent for dierent j, otherwise, there will be two points on a
vertical line and the polynomial P
n
(x) cannot be dened.
Since we require that the degree of P
n
is n or less, there is only one polynomial satisfying the
conditions. This is the so-called uniqueness. If you have another polynomial Q
n
(x) with a degree
less than or equal to n and
Q
n
(x
j
) = y
j
.
Then, we can consider the polynomial R
n
(x) = P
n
(x) Q
n
(x). Since P
n
and Q
n
are both polynomials
with degree n, so is R
n
. However,
R
n
(x
i
) = 0, j = 0, 1, ..., n.
A polynomial of degree n can have only n zeros, but R
n
has n + 1 zeros. This actually means R
n
= 0.
Therefore, P
n
= Q
n
.
The polynomial P
n
has an explicit formula called Lagrange interpolation formula. For n = 1,
we are given two points (x
0
, y
0
) and (x
1
, y
1
), then
P
1
(x) = y
0
x x
1
x
0
x
1
+y
1
x x
0
x
1
x
0
.
For n = 2, we have one more point (x
2
, y
2
), then
P
2
(x) = y
0
(x x
1
)(x x
2
)
(x
0
x
1
)(x
0
x
2
)
+y
1
(x x
0
)(x x
2
)
(x
1
x
0
)(x
1
x
2
)
+y
2
(x x
0
)(x x
1
)
(x
2
x
0
)(x
2
x
1
)
.
15
In general,
P
n
(x) =
n
j=0
y
j
k=j
k=0..n
(x x
k
)
(x
j
x
k
)
.
Let us take a closer look at
L
j
(x) =
k=j
k=0..n
(x x
k
)
(x
j
x
k
)
.
The index k goes through 0, ..., j 1, j + 1, ..., n, thus, L
j
is a polynomial of degree n. Meanwhile,
L
j
(x
j
) = 1, L
j
(x
k
) = 0 if k = j.
Now,
P
n
(x) = y
0
L
0
(x) +y
1
L
1
(x) +... +y
n
L
n
(x).
Thus,
P
n
(x
j
) = y
0
L
0
(x
j
) +... +y
j
L
j
(x
j
) +... +y
n
L
n
(x
j
) = y
j
.
To summarize, we have the following theorem:
Theorem 2 Given n + 1 points: (x
j
, y
j
) for j = 0, 1, ..., n, there is a unique polynomial P
n
(x) with
degree n, such that
P
n
(x
j
) = y
j
, j = 0, 1, ..., n,
and P
n
can be written as
P
n
(x) =
n
j=0
y
j
k=j
k=0..n
(x x
k
)
(x
j
x
k
)
.
Interesting results can be obtained from the uniqueness. For example, if we choose y
j
= 1 for all j,
then the Lagrange formula gives:
P
n
(x) =
n
j=0
k=j
k=0..n
(x x
k
)
(x
j
x
k
)
= 1.
We can be sure that the above is true for all x, because of the uniqueness. Since 1 is a polynomial
(degree is 0) that interpolates all these n + 1 points, therefore, it must be P
n
.
We have started the interpolation problem with n + 1 given points. It is often the case that these
points are taken from a function. That is, we have a function f(x), we then take n + 1 points from the
function, that is,
y
j
= f(x
j
), j = 0, 1, ..., n.
Now, we have the polynomial P
n
(x) interpolating these points. This can be useful, if f(x) is very dicult
to calculate. With this interpolation, we can use P
n
(x) to approximate f(x). In the next chapter, we
will use the integral of P
n
(x) to approximate the integral of f(x).
Although f and P
n
are exactly the same at the points x
0
, x
1
, ..., x
n
. For a general x, they are
usually dierent. The question is how close is P
n
(x) to f(x). We have the following result on the error
of polynomial interpolation:
Theorem 3 Let f be a function of x with continuous (n + 1)-th derivative, P
n
(x) be the polynomial
(degree n) interpolating the n + 1 points
(x
j
, f(x
j
)), j = 0, 1, ..., n,
then for each x, there is a , such that
f(x) P
n
(x) =
f
(n+1)
()
(n + 1)!
n
j=0
(x x
j
).
16
To prove the above theorem, let us x x so that x = x
j
for any j, then consider a function of t
dened as
g(t) = f(t) P
n
(t) [f(x) P
n
(x)]
(t x
0
)(t x
1
)...(t x
n
)
(x x
0
)(x x
1
)...(x x
n
)
.
For this function, we can easily see that
g(x) = 0, g(x
j
) = 0, j = 0, 1, 2, ..., n.
Therefore, g has n+2 distinct zeros. From Rolles theorem, if a function has two zeros, then there must
be a zero for its derivative. A simple generalization says: if the function has three distinct zeros, then
its rst derivative has two zeros and its second derivative has one zero. In general, since g has n + 2
zeros, its (n + 1)-th derivative must have one zero. Therefore, we have some , such that
g
(n+1)
() = 0.
Now, if we try to take the derivative of g, we can use
d
n+1
P
n
(t)
dt
n+1
= 0,
d
n+1
[(t x
0
)(t x
1
)...(t x
n
)]
dt
n+1
= (n + 1)!.
Therefore,
f
(n+1)
() 0 [f(x) P
n
(x)]
(n + 1)!
(x x
0
)(x x
1
)...(x x
n
)
= 0.
This leads to the theorem.
Since is not known, we cannot control f
(n+1)
(). However, very often, we have good bounds for
f
(n+1)
(x). On the other hand, we may be interested in approximating f(x) on an interval [a, b]. In that
case, in order to reach small error, it is important to choose the nodes x
0
, x
1
, x
2
, ..., x
n1
, x
n
carefully.
The trivial choice of equally spaced nodes
x
0
= a, x
n
= b, x
j
= a +
j
n
(b a).
may lead to divergence. That means, the error
max
axb
|f(x) P
n
(x)|
can tend to innity as n increases. For an example, we let
f(x) =
1
1 + 25x
2
.
Now, we consider n + 1 points given by
x
j
= 1 +
2j
n
, j = 0, 1, ..., n.
Let P
n
(x) be the polynomial interpolating (x
j
, f(x
j
)) for j = 0, 1, ..., n. The polynomials P
6
, P
8
, P
10
and
the exact function f are shown in the following gure. We can see that when n increases, the maximum
of |f(x) P
n
(x)| actually increases. The maximum is obtained near the ends of the interval. This is
related to the fact that the function
n
j=0
(x x
j
) has very large magnitude near the two ends of the
interval compared with its values near the middle of the interval.
To reduce the error, it is necessary to make the function
n
j=0
(x x
j
) have more or less the same
magnitude everywhere. This can be achieved by putting more nodes near two ends. For the case of
a = 1 and b = 1, the best choice is the zeros of the so-called Chebyshev polynomial T
n+1
(x):
x
j
= cos
_
(j + 1/2)
n + 1
_
.
17
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1
1.5
1
0.5
0
0.5
1
1.5
2
x
n=6
n=8
n=10
Interpolation of 1/(1+25x
2
)
Figure 3.1: Interpolation by equally spaced points.
The Chebyshev polynomials are dened as T
n
(x) = cos(n), assuming x = cos(). They can be generated
by the relationships:
T
0
(x) = 1,
T
1
(x) = x,
T
2
(x) = 2x
2
1,
T
3
(x) = 4x
3
3x,
T
n+1
(x) = 2xT
n
(x) T
n1
(x).
The above choice of {x
j
} is is applied to f(x) = 1/(1 + 25x
2
) again. We have much improved results
as shown in the following gure. Since T
n
(x) = cos n where x = cos(), we have |T
n
(x)| 1 for
1 x 1. Notice that the coecient of x
n
in T
n
(x) is 2
n1
. Therefore 2
1n
T
n
(x) = x
n
+ ... is a
monic polynomial (the coecient of the term of highest degree is 1). If p is another monic polynomial
of degree n, then it can be proved that
max
1x1
|p(x)| 2
1n
.
Therefore, if we want to minimize the maximum absolute value of (x x
0
)(x x
1
)...(x x
n
), we
should choose the monic polynomial 2
n
T
n+1
(x) (which is of degree n + 1), that is, we should choose
x
0
, x
1
, ..., x
n
as the zeros of the T
n+1
.
Theorem 4 If the nodes x
0
, x
1
, ..., x
n
are the zeros of the Chebyshev polynomial T
n+1
, then the error
formula of polynomial interpolation (for |x| 1) yields
|f(x) P
n
(x)|
1
2
n
(n + 1)!
max
|t|1
|f
(n+1)
(t)|
Notice that we should also get very good results if we choose the extrema of the Chebyshev polynomial
T
n
(x):
x
j
= cos(
j
n
).
18
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1
0.2
0
0.2
0.4
0.6
0.8
1
1.2
n=6
n=8
n=10
x
Interpolation at the Chebyshev zeros
Figure 3.2: Polynomial interpolation of f(x) = 1/(1 + 25x
2
) based on the zeros of the Chebyshev
polynomial.
A MATLAB implementation of the Lagrange formula is given below:
% This function evaluates the polynomial interpolating
% x(j), y(j), for j = 1: length(x).
% This is a polynomial of variable t.
function p = lagrange(t, x, y)
np1 = length(x);
p = zeros(size(t));
for j = 1 : np1
z = ones(size(t));
for k = 1 : np1
if k ~= j
z = z .* (t - x(k));
z = z/(x(j) - x(k));
end
end
p = p + y(j) * z ;
end
3.2 Cubic spline interpolations
As we have seen in the previous section, if we want to have a good approximation for a function
on an interval [a, b], we must choose the nodes x
0
, x
1
, ..., x
n
carefully. However, the nodes are often
given, and we cannot choose. Meanwhile, if we have some data from an experiment, we really do
not know the function f. All we have is the given n + 1 points (x
j
, y
j
) for j = 0, 1, ..., n (assuming
19
x
0
< x
1
< x
2
< ... < x
n
). In that case, polynomial interpolation described in section 4.1 is not good if
n is large. We prefer to use a piecewise polynomial. Cubic spline is a piecewise cubic polynomial.
For the given n + 1 points, we are looking for a function S(x) which is a cubic polynomial of x on
each interval [x
j
, x
j+1
] for j = 0, 1, .., n 1. In other words:
S(x) =
_
_
S
1
(x) x
0
x < x
1
S
2
(x) x
1
x < x
2
...
S
n
(x) x
n1
x x
n
(3.1)
where S
j
(x) is a cubic polynomial. The function S and its derivatives S
, S
are required to be contin-

uous.
If we write down a general polynomial of degree 3, there are four coecients. Therefore, the n cubic
polynomials S
1
, S
2
, ..., S
n
, have 4n coecients. To determine these 4n coecients, we should nd 4n
conditions. First, we should have
S(x
j
) = y
j
, j = 0, 1, ..., n.
Since S is piecewise, we actually have 2n conditions (each piece has two conditions):
S
j
(x
j1
) = y
j1
, S
j
(x
j
) = y
j
, j = 1, 2, ..., n.
With the above, S is already continuous. Next, we must have the continuity of S
and S
. Actually, we
do not know the values S
(x
j
) and S
(x
j
). Besides, the continuity condition does not give anything for
x
0
and x
n
. Therefore, we have the following 2n 2 conditions:
S
j
(x
j
) = S
j+1
(x
j
), S
j
(x
j
) = S
j+1
(x
j
), j = 1, 2, ..., n 1.
The total number of conditions is thus 4n 2, but we have 4n coecients to determine. Two extra
conditions are needed. For the so-called natural cubic spline, we set
S
(x
0
) = S
(x
n
) = 0.
One possible approach is to setup 4n equations for 4n unknowns directly. A better approach is to use
the available conditions for each cubic polynomial rst to reduce the number of unknowns. For the cubic
polynomial S
j
, we already know its values at x
j1
and x
j
. If we also know its second derivative at x
j1
and x
j
, then we can determine S
j
completely. In reality, we do not know its second derivative. Thus,
we can assume the second derivative at x
1
, x
2
, ..., x
n1
as the unknowns. We introduce the following
notation:
S
(x
j
) = y
j
, j = 0, 1, ..., n, and y
0
= 0 = y
n
.
Thus y
1
, y
2
, ..., y
n1
are the unknowns.
Theorem 5 Let S
j
(x) be a cubic polynomial of x satisfying
S
j
(x
j1
) = y
j1
, S
j
(x
j
) = y
j
, S
j
(x
j1
) = y
j1
, S
j
(x
j
) = y
j
,
then S
j
can be written as
S
j
(x) = Ay
j1
+By
j
+
h
2
j
6
_
(A
3
A)y
j1
+ (B
3
B)y
(3.2)
where
h
j
= x
j
x
j1
A =
x x
j
x
j1
x
j
=
x x
j
h
j
B =
x x
j1
x
j
x
j1
=
x x
j1
h
j
= 1 A.
20
Proof: We notice that the rst part Ay
j1
+By
j
is the Lagrange formula for the polynomial interpolating
two points. The terms A and B are polynomials of x with degree 1, then, A
3
and B
3
are polynomials
of degree 3. Thus, the formula for S
j
gives a cubic polynomial. Meanwhile, we have
A = 1, B = 0, if x = x
j1
.
Thus, S
j
(x
j1
) = y
j1
. Similarly,
A = 0, B = 1, if x = x
j
.
Thus, S
j
(x
j
) = y
j
. For the second derivative, we can verify that
S
j
(x) = Ay
j1
+By
j
.
After that we can easily verify the two conditions for S
j
.
Alternatively, we can use the following formula:
S
j
(x) = y
j1
+C
j
(x x
j1
) +
y
j1
2
(x x
j1
)
2
+
y
j
y
j1
6h
j
(x x
j1
)
3
(3.3)
where
C
j
=
y
j
y
j1
h
j
h
j
6
(y
j
+ 2y
j1
).
To use the above formula for S
j
, we must solve y
1
, ..., y
n1
rst. We can get some equations from
the continuity of S
. Using the above formula for the cubic polynomials, the condition
S
j
(x
j
) = S
j+1
(x
j
)
can be reduced to
h
j
y
j1
+ 2(h
j
+h
j+1
)y
j
+h
j+1
y
j+1
= 6
_
y
j+1
y
j
h
j+1
y
j
y
j1
h
j
_
(3.4)
for j = 1, 2, ..., n 1. Since y
0
= y
n
= 0, when the above n 1 equations are written down, we have a
linear system for the n1 unknowns: y
1
, ..., y
n1
. Once they are solved, we can use the earlier formula
for the cubic polynomials.
Here is a simple MATLAB program for natural cublic spline.
% For n+1 points given in the vectors x and y, this program
% calculates the natural spline function S(x). For each piece,
% the function is evaluated at m points.
function [S, xx] = ncubic(x,y, m)
n = length(x) -1;
ypp = ncubic2(x,y);
xx(1) = x(1);
S(1) = y(1);
for k=1:n
h = x(k+1) - x(k);
B = 1/m : 1/m : 1;
A = 1-B;
xadd = x(k) + h*B;
xx = [xx, xadd];
21
Sadd = y(k)*A + y(k+1) *B;
Sadd = Sadd + (h^2/6)*( (A.^3-A)*ypp(k) + (B.^3-B)*ypp(k+1));
S = [S, Sadd];
end
plot(x, y, o)
hold on
plot(xx, S)
hold off
% Given n+1 points: [x(i), y(i)], i=1, 2, ..., n+1,
% we determine the second derivative of the natural
% cubic spline function at these node points.
%
% ypp(i) is = S at x(i), for i=1, 2, ..., n+1.
function ypp = ncubic2(x, y)
n = length(x) -1;
for j=1:n
h(j) = x(j+1) - x(j);
r(j) = (y(j+1) - y(j))/h(j);
end
A = zeros(n-1,n-1);
b = zeros(n-1,1);
for j=1:n-1
A(j,j) = 2*(h(j)+h(j+1));
b(j) = 6*(r(j+1)-r(j));
end
for j=1:n-2
A(j+1,j) = h(j);
A(j,j+1) = h(j+1);
end
ypp = zeros(n+1,1);
ypp(1) = 0;
ypp(2:n,1) = A\b;
ypp(n+1) = 0;
For the following ve points:
(0, 3), (1, 1), (2, 1), (3, 2), (4, 0.5)
the cubic spline function S(x) is shown in the following gure.
If we know the values of the rst derivative at x
0
and x
n
, we should use the clamped cubic spline.
In that case, S(x) is still a piecewise polynomial of degree 3 with continuous second order derivative,
but S(x) satises
S
(x
0
) = , S
(x
n
) = ,
where and are given constants. In this case, we can proceed as follows:
1. Write down a formula for S
j
(x) assuming S
j
and S
j
are known at x
j1
and x
j
.
22
0 0.5 1 1.5 2 2.5 3 3.5 4
0.5
1
1.5
2
2.5
3
Figure 3.3: Cubic spline function S(x) for the given ve points.
2. Find S
j
(x) at x
j1
and x
j
.
3. Establish an equation for the rst order derivatives based on S
j
(x
j
) = S
j+1
(x
j
).
To apply clamped cublic spline, you rst solve the rst order derivative at the interior points, i.e., S
(x
j
)
for 1 j < n, then use the formula obtained in Step 1 above.
23
Chapter 4
Numerical Integration
As you know from calculus, many functions cannot be integrated analytically. On the other hand,
numerical methods can nd approximate values of an integral for almost any function. In this chapter,
we will consider the denite integral
I =
_
b
a
f(x) dx,
where a and b are constants, f is a given (but arbitrary) function. Our numerical methods can always
be written as
_
b
a
f(x) dx
c
k
f(x
k
),
for some coecients {c
k
} and nodes {x
k
}.
In the rst two sections, the nodes are simply equally spaced points in the interval [a, b]. The third
section presents an adaptive method. The fourth section gives more advanced formulas where the nodes
{x
k
} are zeros of the so-called Legendre polynomial.
4.1 Trapezoidal rule
We rst consider the integral
_
x1
x0
f(x)dx, where x
0
, x
1
are given constants and f is some arbitrary
function. The elementary trapezoidal rule is useful when x
1
x
0
= h is small. In this case,
we approximate f on (x
0
, x
1
) by polynomial of degree 1 interpolating the two points (x
0
, f(x
0
)) and
(x
1
, f(x
1
)). That is,
f(x) P
1
(x) = f(x
0
)
x x
1
x
0
x
1
+f(x
1
)
x x
0
x
1
x
0
.
Then, we approximate the integral of f by the integral of P
1
:
_
x1
x0
f(x)dx
_
x1
x0
P
1
(x)dx = h
_
1
2
f(x
0
) +
1
2
f(x
1
)
_
.
For the integral of f on (a, b), we need the composite trapezoidal rule. It divides the interval
(a, b) into many small ones, then apply the elementary trapezoidal rule to each small interval. We choose
an integer n, then dene
h =
b a
n
, x
0
= a, x
j
= a +jh.
Thus,
_
b
a
f(x)dx =
_
xn
x0
f(x)dx =
n
j=1
_
xj
xj1
f(x)dx
n
j=1
h
_
1
2
f(x
j1
) +
1
2
f(x
j
)
_
24
After a little simplication, we obtain the following composite trapezoidal rule:
_
b
a
f(x)dx h
_
1
2
f(x
0
) +f(x
1
) +... +f(x
n1
) +
1
2
f(x
n
)
_
A simple MATLAB program for the composite trapezoidal rule and the function f(x) = sin(x)e
x
2
is
given below:
% The following function gives an approximation
% to the intergral of f(x) on (a,b) based on the
% trapezoidal rule. The integer n
% is the number of small intervals.
% a = x_0, x_1, x_2, ..., x_n = b.
% h = (b-a)/n.
% This program calls function f(x).
function z = ctrape(a,b,n)
h = (b-a)/n;
xin = a+h*(1:n-1);
y = f(xin);
z = (sum(y) + (f(a)+f(b))/2) * h;
function y = f(x)
y = sin(x) .* exp(-x.^2);
The exact value is
_
1
0
sin(x)e
x
2
dx = 0.294698182...
For n = 200, the numerical result is 0.294695....
So far, we do not know how to choose n. It we just pick up some n, we do not know whether our
answer is accurate enough. If we choose a very large n, the solution will be more accurate, but it also
takes more computing eort. For this purpose, a theory on the error of the trapezoidal rule is helpful. We
start with the elementary trapezoidal rule. If f has continuous second order derivative, we can use the
error formula for polynomial interpolation (when n = 1) to derive an error formula for the trapezoidal
rule. First, for the elementary formula on (x
0
, x
1
), we have a number (x
0
, x
1
), such that
_
x1
x0
f(x)dx
h
2
[f(x
0
) +f(x
1
)] =
h
3
12
f
().
This can be used to obtain the error formula for the composite trapezoidal rule.
Theorem 6 If f has continuous second order derivative on [a, b], then there is a number (a, b), such
that
_
b
a
f(x)dx h
_
1
2
f(x
0
) +f(x
1
) +... +f(x
n1
) +
1
2
f(x
n
)
_
=
(b a)
3
12n
2
f
(),
where h = (b a)/n, x
j
= a + jh.
Although we do not know the value of , if we have some bounds for f
, we will have some idea about

the error. For example, if we know some constant C, such that
|f
(x)| C, a x b,
and if we want the error to be less than 10
8
, then we can solve n from
(b a)
3
C
12n
2
10
8
.
25
4.2 The Simpsons method
In the last section, a polynomial of degree 1 is used to derive the elementary trapezoidal rule. It is natural
to use polynomials of a higher degree to derive approximate integration formulas. The Simpsons method
is derived from a quadratic polynomial interpolating three points. The idea of deriving an composite
formula from an elementary formula is the same.
The elementary Simpsons rule is derived on the interval (x
0
, x
2
), where x
1
x
0
= x
2
x
1
= h. For
the three points
(x
0
, f(x
0
)), (x
1
, f(x
1
)), (x
2
, f(x
2
)),
we have a quadratic polynomial P
2
(x) which interpolates these three points and approximates f(x).
Then,
_
x2
x0
f(x)dx
_
x2
x0
P
2
(x)dx
We can write down P
2
(x) by the Lagrange interpolation formula and integrate P
2
(x). This gives rise to
the following elementary Simpsons rule:
_
x2
x0
f(x)dx h
_
1
3
f(x
0
) +
4
3
f(x
1
) +
1
3
f(x
2
)
_
.
When we want to integrate f on a larger interval (a, b). We can use the composite Simpsons
rule. This is obtained by applying the elementary formula on smaller intervals. Staring with an even
integer n, we dene
h =
b a
n
, x
j
= a +jh.
Obviously, x
0
= a and x
n
= b. Thus,
_
b
a
f(x)dx =
_
x2
x0
f(x)dx +
_
x4
x2
f(x)dx +... +
_
xn
xn2
f(x)dx.
Now, we apply the elementary formula for each of the above integral. After some simplication, we
obtain the following composite Simpsons rule:
_
b
a
f(x)dx h
_
1
3
f(x
0
) +
4
3
f(x
1
) +
2
3
f(x
2
) +... +
2
3
f(x
N2
) +
4
3
f(x
N1
) +
1
3
f(x
N
)
_
Similar to the trapezoidal rule, an error formula is useful to determine a proper value of n, when a
required accuracy is specied. If f has a continuous fourth order derivative, we can prove that there is
some (x
0
, x
2
), such that
_
x2
x0
f(x)dx
h
3
[f(x
0
) + 4f(x
1
) +f(x
2
)] =
h
5
90
f
(4)
().
More generally, we have the following theorem on the error of the the composite Simpsons rule.
Theorem 7 If f has a continuous fourth order derivative on (a, b), then there is a number (a, b),
such that
_
b
a
f(x)dx
h
3
[f(x
0
) + 4f(x
1
) +... + 4f(x
n1
) +f(x
n
)] =
(b a)
5
180n
4
f
(4)
(),
where n is even, h = (b a)/n and x
j
= a +jh for j = 0, 1, ..., n.
We should notice that the Simpsons method has a fourth order of accuracy. That is, the error decreases
at 1/n
4
as n increases. The trapezoidal rule has a second order of accuracy.
26
4.3 Adaptive Simpsons method
When we use the Simpsons method to integrate a function on the interval (a, b), we have to choose a
number n to divide the interval into n small intervals. Although we have some idea about this from our
error formula, it is still quite dicult to decide, since usually we do not know the 4th derivative of the
function. The situation with Gauss-Legendre quadrature is similar, we still have to choose the number
of points n. Meanwhile, some functions may be very smooth in some region and have sharp changes
in other regions, it seems wasteful to use the same mesh size h in all these dierent regions. In recent
years, adaptive numerical methods have been developed. For numerical integration, one of the simplest
adaptive method is the adaptive Simpsons method. This method has two features:
1. We do not have to choose an integer n. Instead, we need to specify an error tolerance . For the
integral of f(x) on (a, b), the numerical method returns a value S, such that
_
b
a
f(x)dx S
.
is usually true. But this cannot be guaranteed.
2. The numerical method evaluates f at some points. These points are chosen automatically and
they are not uniformly distributed. More points are used in regions where the function f varies
rapidly.
We will present the adaptive Simpsons method as a recursive method. Starting from the original interval
(a, b), we somehow end up working on a sub-interval (a
, b
) which has a length of (ba)/2

p
for an integer
p. Now, we wish to nd a value S
, such that
_
b
f(x)dx S

2
p
. (4.1)
First, we can calculate an approximate integral on (a
, b
) by elementary Simpsons rule:

_
b
f(x)dx S
0
=
b
6
[f(a
) + 4f(c
) +f(b
)] = S(a
, b
), c
=
a
+b
2
,
To simplify the notations, we use S(a
, b
) to denote the elementary Simpsons rule on the interval (a
, b
).
Now, on the two half intervals (a
, c
) and (c
, b
), we can also evaluate the integrals by elementary

Simpsons rule:
_
c
f(x)dx S
10
= S(a
, c
),
_
b
f(x)dx S
11
= S(c
, b
).
Therefore, we have
_
b
f(x)dx S
1
= S
10
+S
11
= S(a
, c
) +S(c
, b
).
That is, we now have two approximations: S
0
and S
1
for the integral of f on (a
, b
). Assuming the 4th

derivative of f is continuous and if (a
, b
) is a small interval, then

_
b
f(x)dx S
0
= C =
h
5
90
f
(4)
(),
_
b
f(x)dx S
1

C
16
,
where (a
, b
) and h = (b
)/2. The rst equation is just the error formula of the elementary
Simpsons rule. The second equation is obtained from the same error formula used on (a
, c
) and (c
, b
),
27
where h becomes h/2, but we have to add the two errors. It is only an approximation, since we have
two in the the second equation and they are dierent from the in the rst equation. Now, if
|S
1
S
0
|
15
16
|C|
15
2
p
,
then

_
b
f(x)dx S
1
|C|
16

2
p
,
we can thus return S
= S
1
as the approximation. If |S
1
S
0
| is still too large, we can work on its
two sub-intervals (a
, c
) and (c
, b
) with exactly the same algorithm. Suppose the method nds the
approximations T
10
and T
11
, such that
_
c
f(x)dx T
10

2
p+1
,
_
b
f(x)dx T
11

2
p+1
,
we can then return S
= T
10
+ T
11
. In reality, the condition |S
1
S
0
| 15/2
p
may be changed to
|S
1
S
0
| m/2
p
, where m < 15, for example m = 8. This will give us more condence that (4.1) is
satised.
Here is a MATLAB program for adaptive Simpsons method:
function S=asimpson(a,b,ep0)
%
% Adaptive Simpsons method. You need to supply a function in f.m.
%
dx = (b-a)/300; % can be removed
x = a:dx:b; % can be removed
plot(x, f(x)) % can be removed
hold on % can be removed
c=(a+b)/2;
fa=f(a);
plot(a,fa,o) % can be removed
fb=f(b);
plot(b,fb,o) % can be removed
fc=f(c);
plot(c,fc,o) % can be removed
p=0;
h3 = (b-a)/6;
S0 = (fa+4*fc+fb)*h3;
S = rsimpson(a,b,c,fa,fb,fc,S0,ep0,p)
hold off % can be removed
function S=rsimpson(a,b,c,fa,fb,fc,S0,ep0,p)
%
% the real adaptive Simpsons method.
%
m = 8; % m should be less than 15.
d0 = (a+c)/2;
28
d1 = (c+b)/2;
fd0=f(d0);
plot(d0,fd0,o) % can be removed
fd1=f(d1);
plot(d1,fd1,o) % can be removed
h3=(b-a)/12;
S10 = (fa+4*fd0+fc)*h3;
S11 = (fc+4*fd1+fb)*h3;
S1 = S10+S11;
if abs(S1-S0) < m*ep0/2^p
S=S1;
else
p1 = p+1;
T10 = rsimpson(a,c,d0,fa,fc,fd0,S10,ep0,p1);
T11 = rsimpson(c,b,d1,fc,fb,fd1,S11,ep0,p1);
S=T10+T11;
end
Now for the function f(x) = sin(x
4
)/(1 +x), we have
>> asimpson(0,2,0.0001)
ans =
0.1927
Meanwhile, we have the following gure: We can see that the points where f are evaluated (shown as
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.5
0
0.5
Figure 4.1: Adaptive Simpsons method applied to the integral of sin(x
4
)/(1 +x) on (0, 2).
the little o in the gure) are not uniformly distributed on the x axis.
29
4.4 Gauss-Legendre quadrature
The two methods (Trapezoidal rule and Simpsons rule) in the rst and second sections use the equally
spaced nodes, that is, x
k+1
x
k
= h for all k. This seems to be natural, but actually, if we choose the
nodes in a dierent way, we can get more accurate formulas.
The n-point Gauss-Legendre quadrature formula is:
_
1
1
f(x)
n
k=1
c
k
f(x
k
), (4.2)
where x
1
, x
2
, ..., x
n
are zeros of the Legendre polynomial of degree n, c
k
is given by
c
k
=
_
1
1
j=k
j=1..n
(x x
j
)
(x
k
x
j
)
dx.
For n = 1, we have
_
1
1
f(x)dx 2f(0).
For n = 2, we have
_
1
1
f(x)dx f
_
1
3
_
+f
_
3
_
.
The Gauss-Legendre quadrature formula above is only for integrals on [1, 1], i.e.
_
1
1
f(x)dx. In
general, we want the integral on [a, b]. This can be achieved by a transformation. We will use t for the
variable on [a, b], thus we let
t = a +
b a
2
(x + 1)
and then
_
b
a
f(t) dt =
b a
2
_
1
1
f
_
a +
b a
2
(x + 1)
_
dx
b a
2
n
k=1
c
k
f
_
a +
b a
2
(x
k
+ 1)
_
The Legendre polynomials are dened by:
L
0
= 1, L
1
= x,
(n + 1)L
n+1
= (2n + 1)xL
n
nL
n1
, for n 1.
The denition above uses the scaling such that: L
n
(1) = 1 for all n. Here are some more Legendre
polynomials:
L
2
=
3 x
2
2
1/2
L
3
=
5 x
3
2

3 x
2
L
4
=
35 x
4
8

15 x
2
4
+ 3/8
L
5
=
63 x
5
8

35 x
3
4
+
15 x
8
L
6
=
231 x
6
16

315 x
4
16
+
105 x
2
16

5
16
30
L
7
=
429 x
7
16

693 x
5
16
+
315 x
3
16

35 x
16
L
8
=
6435 x
8
128

3003 x
6
32
+
3465 x
4
64

315 x
2
32
+
35
128
L
9
=
12155 x
9
128

6435 x
7
32
+
9009 x
5
64

1155 x
3
32
+
315 x
128
L
10
=
46189 x
10
256

109395 x
8
256
+
45045 x
6
128

15015 x
4
128
+
3465 x
2
256

63
256
These Legendre polynomials are orthogonal to each other in the following sense:
_
1
1
L
i
(x)L
j
(x)dx = 0, when i = j.
The next question is why the Gauss-Legendre quadrature formulas are more accurate. For this
purpose, we compare the methods requiring 3 evaluations of f. The rst is a composite trapezoidal rule:
_
1
1
f(x)dx
1
2
f(1) +f(0) +
1
2
f(1). (4.3)
The formula has no error if f is a polynomial of degree 1. Now, the elementary Simpsons rule is
_
1
1
f(x)dx
1
3
f(1) +
4
3
f(0) +
1
3
f(1). (4.4)
This formula has no error if f is any polynomial of degree 3 (note: not just 2). The 3-point Gauss-
Legendre formula is
_
1
1
f(x)dx
5
9
f(
0.6) +
8
9
f(0) +
5
9
f(
0.6). (4.5)
It turns out that this formula has no error if f is any polynomial of degree 5. These three methods above
all require to evaluate the function f three times, thus, they require the same amount of computation.
Clearly, the 3-point Gauss-Legendre formula is the most accurate. In general, we have
Theorem 8 The n-point Gauss-Legendre formula is exact when f(x) is any polynomial of degree less
than or equal to 2n 1.
Instead of a general proof, we will outline the main ideas through an example. We consider n = 3 and
f(x) = x
5
+x
4
x
2
+x + 2.
1. The cubic Legendre polynomial is L
3
(x) = (5x
3
3x)/2. It is easy to verify that
_
1
1
L
3
(x) dx =
_
1
1
L
3
(x) x dx =
_
1
1
L
3
(x) x
2
dx = 0.
Hence, if h(x) = a
0
+a
1
x +a
2
x
2
is any quadratic polynomial, we have
_
1
1
L
3
(x) h(x) dx = a
0
_
1
1
L
3
(x) dx +a
1
_
1
1
L
3
(x) x dx +a
2
_
1
1
L
3
(x) x
2
dx = 0.
2. For the function f(x) given above, we can divide it by L
3
(x) and obtain the quotient h(x) and
remainder r(x),
f(x) = x
5
+x
4
x
2
+x + 2
=
2
5
x
2
(
5
2
x
3
3
2
x +
3
2
x) +x
4
x
2
+x + 2
31
=
2
5
x
2
L
3
(x) +x
4
+
3
5
x
3
x
2
+ x + 2
=
2
5
x
2
L
3
(x) +
2
5
x(
5
2
x
3
3
2
x +
3
2
x) +
3
5
x
3
x
2
+x + 2
= (
2
5
x
2
+
2
5
x)L
3
(x) +
3
5
x
3
2
5
x
2
+x + 2
= ...
= (
2
5
x
2
+
2
5
x +
6
25
)L
3
(x)
2
5
x
2
+
34
25
x + 2.
Therefore, we have f(x) = h(x)L
3
(x) +r(x), where
h(x) =
2
5
x
2
+
2
5
x +
6
25
r(x) =
2
5
x
2
+
34
25
x + 2.
3. The three zeros of L
3
(x) are
x
1
=
0.6, x
2
= 0, x
3
=
0.6.
Let P
2
(x) be the polynomial (of degree 2) interpolating the 3 points: (x
j
, f(x
j
)) for j = 1, 2, 3.
By the Lagrange interpolation formula
P
2
(x) =
3
j=1
f(x
j
)
k=j
k=1..3
x x
k
x
j
x
k
.
On the other hand, r(x) = P
2
(x). This is so because:
f(x
j
) = h(x
j
)L
3
(x
j
) +r(x
j
) = h(x
j
) 0 +r(x
j
) = r(x
j
).
for j = 1, 2, 3. P
2
(x) is now considered as the polynomial interpolating (x
j
, r(x
j
)). But r(x) itself
is a quadratic polynomial, from the uniqueness of polynomial interpolation, P
2
(x) must be the
same as r(x).
4. If we integrate the equation f(x) = h(x)L
3
(x) +r(x), we have
_
1
1
f(x) dx =
_
1
1
h(x)L
3
(x) dx +
_
1
1
r(x) dx
= 0 +
_
1
1
P
2
(x) dx
=
_
1
1
3
j=1
f(x
j
)
k=j
k=1..3
x x
k
x
j
x
k
dx
=
3
j=1
_
_
1
1
k=j
k=1..3
x x
k
x
j
x
k
dx
_
f(x
j
)
=
3
j=1
c
j
f(x
j
).
Since the n-point Gauss-Legendre formula is exact for any polynomial of degree 2n1, as a special
case, we have
_
1
1
x
m
dx =
n
j=1
c
j
x
m
j
, for m = 0, 1, ..., 2n 1. (4.6)
The above equations gives an alternative approach to obtain the nodes {x
k
} and coecients {c
k
}. These
are 2n equations and we have solve the 2n parameters {x
k
, c
k
}.
32
Chapter 5
Fast Fourier Transform
In this chapter, we rst introduce the Discrete Fourier Transform (DFT). It can be regarded as the
multiplication of a vector by a square matrix F. In the rst section, we give a proof for a formula of
F
1
. Usually, the number of operations needed to multiply an N N matrix with a vector is 2N
2
.
FFT (Fast Fourier Transform) is a special technique to multiply the matrix F. The required number of
operations is only 5N log
2
N.
5.1 Fourier series
Let f be dened on (, ), then under suitable conditions,
f(x) =
a
0
2
+
j=1
[a
j
cos(jx) +b
j
sin(jx)]
where
a
j
=
1
f(x) cos(jx)dx, j 0
b
j
=
1
f(x) sin(jx)dx, j > 0.

If f is even on (, ), then b
j
= 0 and
f(x) =
a
0
2
+
j=1
a
j
cos(jx).
The formula for a
j
can be reduced to
a
j
=
2
_

0
f(x) cos(jx)dx.
If f is odd on (, ), then a
j
= 0 and
f(x) =
j=1
b
j
sin(jx)
and the formula for b
j
can be reduced to
b
j
=
2
_

0
f(x) sin(jx)dx.
33
Complex version of the standard Fourier series on (, ) is
f(x) =
j=
f
j
e
ijx
where
f
j
=
1
2
_

f(x)e
ijx
dx
If the function f is dened on (0, 2), the above is still true if the integration interval for the
coecients a
j
, b
j
and

f
j
is changed from (, ) to (0, 2). We notice that
a
j
=

f
j
+

f
j
b
j
= i(

f
j

f
j
)
5.2 Discrete Fourier Transform
For a given integer N > 0, we consider the relationship:
f
k
=
N1
j=0
f
j
e
i2jk/N
, for k = 0, 1, 2, ..., N 1. (5.1)
Then, we will prove that the following is true:
f
j
=
1
N
N1
k=0
f
k
e
i2jk/N
. (5.2)
To obtain the above coecient formula for

f
j
, we rst re-write condition (5.1) as a system of N equations
for the N coecients {
f
j
}, if we assume f
k
are given for k = 0, 1, ..., N 1. More precise, let
= e
i2/N
we have e
i2jk/N
=
jk
and
_
_
f
0
f
1
f
2
.
.
.
f
N1
_
_
= F
_
f
0
f
1
f
2
.
.
.
f
N1
_
_
where
F =
_
_
1 1 1 ... 1
1
2
...
N1
1
2
4
...
2(N1)
.
.
.
.
.
.
.
.
.
.
.
.
1
N1
2(N1)
...
(N1)(N1)
_
_
.
This implies that
_
f
0
f
1
f
2
.
.
.
f
N1
_
_
= F
1
_
_
f
0
f
1
f
2
.
.
.
f
N1
_
_
.
34
However, we can prove that the following is true:
1
N
FF = I, or F
1
=
1
N
F (5.3)
where F is the complex conjugate of F. The (j, k) entry of F is simply
(j1)(k1)
, where
= e
i2/N
.
With this formula for F
1
, equation (5.2) should be clear.
A matrix Q is called unitary, if
Q
1
= Q
T
,
that is, the inverse of Q is its transpose and complex conjugate. If Q is real and unitary, thus, Q
1
= Q
T
,
then Q is an orthogonal matrix. We notice that
1
N
F is a unitary matrix.
To prove (5.3), we notice that the (j+1,k+1) entry of FF is
1 +
j
k
+
2j
2k
+
3j
3k
+... +
(N1)j
(N1)k
= 1 +t +t
2
+... +t
N1
where
t = e
i2(jk)/N
.
If j = k, then t = 1 and the above sum is N. This gives the correct diagonal entries. If j = k, then t = 1
and
1 +t +t
2
+... +t
N1
=
1 t
N
1 t
=
1 1
1 t
= 0
since t
N
= e
i2(jk)
= 1. Therefore, the o-diagonal entries of FF are all zero.
Since (5.1) and (5.2) are simply matrix vector multiplications (for matrix F or
1
N
F), they can be
easily evaluated with about 2N
2
operations. FFT is an algorithm for evaluating (5.1) or (5.2) with only
O(N log N) operations. Because for many practical applications, N is very large, this makes a major
dierence. FFT is extremely useful for signal processing and numerical solution of PDEs.
5.3 Fast Fourier Transform
FFT (Fast Fourier Transform) is a fast method to evaluate the discrete Fourier transform (5.1) or (5.2),
discovered by Gauss in 1805 at the age of 28, but unpublished. This idea was lost until re-discovered
in 1965 by Cooley and Tukey. Standard procedure for evaluating (5.1) or (5.2) requires the matrix
multiplication by F or
1
N
F to a vector. This would require about 2N
2
operations. FFT requires
O(N log N) operations.
The basic idea of FFT is divide-and-conquer. The sequence of N numbers is divided as two
sequences of N/2 numbers. The discrete Fourier transform of these two shorter sequences are rst
calculated (recursively), then one combines the results to obtain the nal result.
We will concentrate on (5.1). The objective here is to quickly calculate {f
k
} assuming {
f
j
} is given.
The treatment for (5.2) is similar.
The main idea is: to calculate the discrete Fourier transform of a sequence of N numbers, we rst
calculate the discrete Fourier transforms of two sequences of length N/2, then combine them to obtain
the result. Let us consider the case of N = 8. For
=
8
= e
i2/8
= cos(/4) +i sin(/4) =
1
2
+i
1
2
35
we need to calculate f
0
, f
1
, ..., f
7
if

f
0
,

f
1
, ...,

f
7
are given, by formula (5.1), or
_
_
f
0
f
1
f
2
.
.
.
f
7
_
_
=
_
_
1 1 ... 1
1 ...
7
1
2
...
14
.
.
.
.
.
.
.
.
.
1
7
...
49
_
_
_
f
0
f
1
f
2
.
.
.
f
7
_
_
Alternatively, we can explicitly write down
f
j
=

f
0
+

f
1
j
+

f
2
2j
+... +

f
7
7j
=

f
0
+

f
2
2j
+

f
4
4j
+

f
6
6j
+
j
[

f
1
+

f
3
2j
+

f
5
4j
+

f
7
6j
]
Let
4
= e
i2/4
=
2
8
= i
we have
f
j
=

f
0
+

f
2
j
4
+

f
4
2j
4
+

f
6
3j
4
+
j
[

f
1
+

f
3
j
4
+

f
5
2j
4
+

f
7
3j
4
]. (5.4)
We could dene the discrete Fourier transform of length 4, for {
f
0
,

f
2
,

f
4
,

f
6
} and {
f
1
,

f
3
,

f
5
,

f
7
}, say
_
3
_
_
=
_
_
1 1 1 1
1
4

2
4

3
4
1
2
4

4
4

6
4
1
3
4

6
4

9
4
_
_
_
f
0
f
2
f
4
f
6
_
_
and
_
3
_
_
=
_
_
1 1 1 1
1
4

2
4

3
4
1
2
4

4
4

6
4
1
3
4

6
4

9
4
_
_
_
f
1
f
3
f
5
f
7
_
_
Now, for j = 0, 1, 2, 3, Equation (3.6) can be written as
f
j
=
j
+
j
j
.
But,
j
and
j
are not dene for j > 3. However, we notice that
j
4
=
j4
4
and
j
8
=
j4
8
.
Therefore, we have
f
j+4
=
j

j
j
for j = 0, 1, 2, 3.
In general, for a given N vector

f, we rst calculate the discrete Fourier transform of these two
vectors:
(

f
0
,

f
2
, ...,

f
N2
), and (

f
1
,

f
3
, ...,

f
N1
).
This is a recursive process. Namely, the computation of discrete Fourier transform of N/2-vectors should
follow the same procedure and reduce to discrete Fourier transform of N/4-vectors. When the results
for N/2-vectors are available, say
j
,
j
for j = 0, 1, ..., N/2 1, we proceed with:
f
j
=
j
+
j
j
f
N/2+j
=
j

j
j
where =
N
= e
i2/N
. Assuming the powers of are already calculated. The above combination step
requires N/2 complex multiplications and N additions (or subtractions). Since each complex multiplica-
tion requires 6 real operations and each complex addition requires 2 real operations. The total required
36
number of operations for the combination step is 5N. If we denote by T
N
the total required number of
(real) operations for FFT of an N-vector, we have
T
N
= 2T
N/2
+ 5N
and T
1
= 0. Let us assume that N is an integer power of 2, from the equation above, we also have
T
N/2
= 2T
N/4
+ 5N/2.
This leads to
T
N
= 2
2
T
N/4
+ 2 5N.
Similarly,
T
N
= 2
3
T
N/2
3 + 3 5N.
Eventually, if N = 2
p
or p = log
2
N,
T
N
= 2
p
T
N/2
p +p 5N = NT
1
+ 5N log
2
N = 5N log
2
N.
5.4 Matrix Factorization for FFT
In the previous section, we presented FFT as a recursive algorithm. In this section, we describe sparse
matrix factorization for FFT. This leads to a simple explanation for the eciency of FFT algorithm
and it also can be used for non-recursive implementation of FFT.
We start with DFT for N = 8. We have
_
_
f
0
f
1
f
2
f
3
f
4
f
5
f
6
f
7
_
_
= F
8
_
f
0
f
1
f
2
f
3
f
4
f
5
f
6
f
7
_
_
=
_
_
1 1
1
8
1
2
8
1
3
8
1 1
1
8
1
2
8
1
3
8
_
_
_
3
_
_
= D
8
_
3
_
_
In the above F
8
is the matrix F given in section 5.2 for DFT at N = 8,
j
and
j
are given from the
DFTs of even and odd coecients as in section 5.3, D
8
is the 8 8 matrix given in the middle of the
above equation and it is related to
8
= e
i/4
. If we use the matrix F
4
for DFT at N = 4, we can get
rid of
j
and
j
:
_
_
f
0
f
1
f
2
f
3
f
4
f
5
f
6
f
7
_
_
= F
8
_
f
0
f
1
f
2
f
3
f
4
f
5
f
6
f
7
_
_
= D
8
_
F
4
F
4
_
_
f
0
f
2
f
4
f
6
f
1
f
3
f
5
f
7
_
_
37
We can see that there is a similar (and simpler) relation for F
4
. This gives rise to
_
_
f
0
f
1
f
2
f
3
f
4
f
5
f
6
f
7
_
_
= F
8
_
f
0
f
1
f
2
f
3
f
4
f
5
f
6
f
7
_
_
= D
8
_
D
4
D
4
_
_
_
F
2
F
2
F
2
F
2
_
_
_
f
0
f
4
f
2
f
6
f
1
f
5
f
3
f
7
_
_
,
where
D
4
=
_
_
1 0 1 0
0 1 0 i
1 0 1 0
0 1 0 i
_
_
, F
2
= D
2
=
_
1 1
1 1
_
.
If we introduce a matrix P
8
, such that
P
8
_
f
0
f
1
f
2
f
3
f
4
f
5
f
6
f
7
_
_
=
_
f
0
f
4
f
2
f
6
f
1
f
5
f
3
f
7
_
_
,
then
F
8
= D
8
_
D
4
D
4
_
_
_
D
2
D
2
D
2
D
2
_
_
P
8
.
This is the sparse matrix factorization for F
8
. The matrix P
8
is the matrix representing the bit-reversal
operation. For an integer 0 n < 8, we can write down it in binary form:
n = (abc)
2
= 4a + 2b +c,
where a, b, c {0, 1}. The bit-reversal operation gives (cba)
2
= 4c+2b+a. The sequence (0, 4, 2, 6, 1, 5, 3, 7)
can be obtained from (0, 1, 2, 3, 4, 5, 6, 7) by the bit-reversal operation.
In general, if N is an integer power of 2, we have the following sparse matrix factorization for F
N
:
F
N
= D
N
_
D
N/2
D
N/2
_

_
_
D
2
D
2
D
2
.
.
.
D
2
_
_
P
N
,
where P
N
is the N N matrix for bit-reversal operation, D
N
is dened based on = e
i2/N
as
D
N
=
_
I W
I W
_
, W =
_
_
1
2
.
.
.
_
_
.
where I is the (N/2) (N/2) identity matrix.
38
5.5 Derivative based on Discrete Fourier Transform
Let us assume that f(x) is a periodic function with period L, but we only know N points of this function:
f(x
j
) for x
j
=
jL
N
, j = 0, 1, ..., N 1.
How do we nd the approximation of f
on these N points?
1. We rst approximate f by g which is a sum of sines and cosines, given in the complex form:
g(x) =
N/21
k=N/2
f
k
e
i2kx/L
.
To calculate the coecients

f
k
, we require that
f(x
j
) = g(x
j
) =
N/21
k=N/2
f
k
e
i2kxj /L
=
N/21
k=N/2
f
k
e
i2jk/N
Some of these k are negative, we separate them
f(x
j
) = f
j
=
N/21
k=0
f
k
e
i2jk/N
+
1
k=N/2
f
k
e
i2jk/N
We let k := k +N for the second term,
f
j
=
N/21
k=0
+
N1
k=N/2
f
kN
e
i2j(kN)/N
=
N/21
k=0
+
N1
k=N/2
f
kN
e
i2jk/N
If we dene

f
k
by
f
k
=
_

f
k
k = 0, 1, ...,
N
2
1
f
kN
k =
N
2
,
N
2
+ 1, ..., N
then the two terms can be combined:
f
j
=
N1
k=0
f
k
e
i2jk/N
.
We use the matrix F for discrete Fourier transform, where the (j + 1, k + 1) entry of F is
jk
=
e
i2jk/N
. The above is
_
_
_
_
_
f(x
0
)
f(x
1
)
.
.
.
f(x
N1
)
_
_
_
_
_
= F
_
_
_
_
_
f
0
f
1
.
.
.
f
N1
_
_
_
_
_
.
The inverse of F is
1
N
F. Thus,
f
k
=
1
N
N1
k=0
f
j
e
i2jk/N
.
Once

f
k
is found, we can nd

f
k
.
39
2. We approximate the derivatives of f by derivatives of g. Namely,
f
(x
j
) g
(x
j
) =
N/21
k=N/2
i2k
L
f
k
e
i2kxj /L
=
N/21
k=N/2
i2k
L
f
k
e
i2jk/N
Once again, we seperate the negative and positive values of k and obtain
f
(x
j
)
N/21
k=0
i2k
L
f
k
e
i2jk/N
+
N1
k=N/2
i2(k N)
L
f
kN
e
i2jk/N
For the second term above, we reset k as k N. If we use

f
k
instead, we have
f
(x
j
)
N/21
k=0
i2k
L
f
k
e
i2jk/N
+
N1
k=N/2
i2(k N)
L
f
k
e
i2jk/N
Let us dene the following matrix
D =
i2
L
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
1
.
.
.
N
2
1
N
2
N
2
+ 1
.
.
.
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
,
then
_
_
_
_
_
f
(x
0
)
f
(x
1
)
.
.
.
f
(x
N1
)
_
_
_
_
_
FD
_
_
_
_
_
f
0
f
1
.
.
.
f
N1
_
_
_
_
_
= FDF
1
_
_
_
_
_
f(x
0
)
f(x
1
)
.
.
.
f(x
N1
)
_
_
_
_
_
=
1
N
FDF
_
_
_
_
_
f(x
0
)
f(x
1
)
.
.
.
f(x
N1
)
_
_
_
_
_
.
Similarly, we have
_
_
_
_
_
f
(x
0
)
f
(x
1
)
.
.
.
f
(x
N1
)
_
_
_
_
_
1
N
FD
2
F
_
_
_
_
_
f(x
0
)
f(x
1
)
.
.
.
f(x
N1
)
_
_
_
_
_
,
_
_
_
_
_
f
(x
0
)
f
(x
1
)
.
.
.
f
(x
N1
)
_
_
_
_
_
1
N
FD
3
F
_
_
_
_
_
f(x
0
)
f(x
1
)
.
.
.
f(x
N1
)
_
_
_
_
_
and so on.
40
Chapter 6
Linear Equations
In this chapter, we study algorithms for linear system of equations:
Ax = b
where A is an invertible square matrix, b is a given vector and x is the unknown vector. Our motivation
is to design ecient and reliable algorithms for computers to solve the problem. Because we are using a
nite precision oating point number system, some algorithms may produce wrong answers, particularly
when the matrix A is near singular. The method we describe in this chapter is a variant of Gaussian
elimination. It corresponds to
PA = LU
where P is a permutation matrix, L is a lower triangular matrix and U is an upper triangular matrix.
6.1 Triangular system
Let A be a non-singular lower triangular matrix. That is
a
ij
= 0, if i < j.
The linear system Ax = b can be easily solved. We can solve x
1
from the rst equation
a
11
x
1
= b
1
,
then solve x
2
from the second equation
a
21
x
1
+a
22
x
2
= b
2
.
In general, after x
1
, ..., x
k1
are solved, we use the k-th equation to solve x
k
. Namely,
a
k1
x
1
+a
k2
x
2
+... +a
kk
x
k
= b
k
.
This gives rise to
x
k
=
1
a
kk
(b
k
a
k1
x
1
a
k2
x
2
... a
k,k1
x
k1
).
This method is called forward substitution.
Similarly, if A is upper triangular, we can use backward substitution. Namely, we calculate x
n
, then
x
n1
, ..., and nally, x
1
.
41
6.2 LU decomposition
When A is a general non-singular matrix, we can solve Ax = b by Gauss elimination. This is related to
the LU decomposition of the matrix A. The LU decomposition is
A = LU
where L is a lower triangular matrix with 1s on the diagonal, U is an upper triangular matrix.
To understand LU decomposition, let us write down the matrix A in the following form:
A =
_
a c
T
b D
_
.
If the (i, j) entry of the n n matrix A is denoted as a
ij
, then
a = a
11
, b =
_
_
a
21
.
.
.
a
n1
_
_, c
T
= [a
12
, ..., a
1n
], D =
_
_
a
22
... a
2n
.
.
.
.
.
.
a
n2
... a
nn
_
_.
If a = 0, we can verify that
A =
_
1 0
b/a I
_ _
a c
T
0 Dbc
T
/a
_
.
In the above, I is the (n 1) (n 1) identity matrix, 0 denotes a zero row or column vector. Now, if
D bc
T
/a has a LU decomposition, namely,
D
bc
T
a
= L
1
U
1
where L
1
is a lower triangular matrix with 1s on the diagonal (called unit lower triangular matrix) and
U is an upper triangular matrix, then
A =
_
1 0
b/a I
_ _
a c
T
0 L
1
U
1
_
=
_
1 0
b/a L
1
_ _
a c
T
0 U
1
_
= LU
where
L =
_
1 0
b/a L
1
_
, U =
_
a c
T
0 U
1
_
.
Therefore, the LU decomposition can be calculated in a recursive way as:
calculate b/a and D(b/a)c
T
;
calculate the LU decomposition of a smaller matrix D(b/a)c
T
.
Finally, the LU decomposition algorithm also tries to save memory space by putting L and U in one
matrix. After the rst step above is completed, we reset A as
A :=
_
a c
T
b/a Dbc
T
/a
_
.
When the LU decomposition algorithm is completed, the memory space of the original matrix A is
replaced by L and U, where U occupies the memory space of A for the diagonals and above. The entries
of A below the main diagonal store the entries of L. Since the diagonals of L are 1s, we do not need to
store them.
As an example, we start with matrix A given below:
A =
_
_
2 1 1
4 1 4
1 5/2 11/2
_
_
.
42
In the rst step, we calculate b/a and D bc
T
/a and replace A by
A :=
_
_
2 1 1
2 1 2
1/2 3 5
_
_
.
In the second step, we nd the LU decomposition for the 22 block of the new A above (i.e. Dbc
T
/a).
This leads to
A :=
_
_
2 1 1
2 1 2
1/2 3 1
_
_
.
The above matrix contains both L and U which are
L =
_
_
1 0 0
2 1 0
1/2 3 1
_
_
, U =
_
_
2 1 1
0 1 2
0 0 1
_
_
.
For an n n matrix A, the LU decomposition algorithm produces matrices L and U stored in the
memory space of A as follows:
For k = 1, 2, ..., n 1,
For i = k + 1, ..., n,
a
ik
= a
ik
/a
kk
. (6.1)
For i = k + 1, ..., n and for j = k + 1, ..., n,
a
ij
= a
ij
a
ik
a
kj
. (6.2)
If we consider k = 1, then we calculate b/a in (6.1) and calculate D(b/a)c
T
in (6.2).
To solve Ax = b, we rst have the LU decomposition for A, then solve Ly = b, nally solve Ux = y.
To nd A
1
, we let A
1
= (k
1
, k
2
, ..., k
n
), then solve
Ak
1
=
_
_
_
1
.
.
.
0
_
_
_, ..., Ak
n
=
_
_
_
0
.
.
.
1
_
_
_.
The LU decomposition for A is rst carried out and this is done only once.
The determinant of A can be easily obtained:
det(A) =
n
j=1
u
jj
.
where u
jj
is the j-th diagonal element of U.
If A is symmetric, and its LU decomposition is A = LU. Then we can prove that:
U =
_
_
_
_
_
u
11
u
22
.
.
.
u
nn
_
_
_
_
_
L
t
= DL
t
,
where u
11
, u
22
, ..., u
nn
are the diagonal entries of U. Therefore, we have:
A = LDL
t
.
43
If the matrix A is symmetric positive denite and it has the decomposition A = LDL
t
, then the
diagonal entries of D are all positive. We have:
A = L
_
_
_
_
_
u
11
u
22
.
.
.
u
nn
_
_
_
_
_
L
t
= L
_
_
_
_
_
u
11
u
22
.
.
.
u
nn
_
_
_
_
_
2
L
t
= SS
t
,
where
S = L
_
_
_
_
_
u
11
u
22
.
.
.
u
nn
_
_
_
_
_
.
This is the Cholesky decomposition.
If A is tri-diagonal, its LU decomposition takes a special simple form. We consider the case when A
is also symmetric. Let
A =
_
_
_
_
_
_
a
1
b
1
b
1
a
2
.
.
.
.
.
.
.
.
. b
n1
b
n1
a
n
_
_
_
_
_
_
= LDL
t
=
_
_
_
_
_
1
l
1
1
.
.
.
.
.
.
l
n1
1
_
_
_
_
_
_
_
_
_
_
d
1
d
2
.
.
.
d
n
_
_
_
_
_
_
_
_
_
_
1 l
1
1
.
.
.
.
.
. l
n1
1
_
_
_
_
_
.
We obtain the formulas:
d
1
= a
1
d
j+1
= a
j+1
b
2
j
d
j
,
l
j
=
b
j
d
j
,
for j = 1, 2, ..., n 1.
6.3 LU decomposition with partial pivoting
The problem with LU decomposition is that some matrix does not have an LU decomposition. For
example,
A =
_
0 1
2 3
_
.
Another problem is that even when A has LU decomposition, a small pivot (the diagonal element used
for creating zeros below) can lead to inaccurate solutions. This is the reason for partial pivoting. It is
simply the row exchange, so that the largest element (in absolute value) in the column (starting from
the diagonal element and all other elements below) is used as the pivot.
For a given n n matrix A, the LU decomposition with partial pivoting is
PA = LU
44
where L is a unit lower triangular matrix, U is a upper triangular matrix, P is a permutation matrix. The
following algorithm overwrites A with L and U. Since L is a unit lower triangular matrix, the diagonal
entries of L always equal 1 and they do not need to be stored. Thus, we use the lower triangular part
of A (without the diagonal) to store the matrix L and use the upper triangular part of A (including
the diagonal) to store the matrix U. The matrix P is not explicitly formed. Instead, we use the integer
vector p (length n1) for the pivoting index. If for the 3rd column, we exchanged the 3rd row with the
5-th row, then p(3) = 5. The algorithm is as follows:
For k = 1, 2, ..., n 1,
Find p
k
, such that
|a
p
k
,k
| = max{|a
kk
|, |a
k+1,k
|, ..., |a
nk
|}.
If p
k
> k, swap the k-th row with the p
k
-th row.
Reset the k-th column for matrix L.
a
ik
:= a
ik
/a
kk
for i = k + 1, ..., n.
Update the trailing (n k) (n k) matrix:
a
ij
:= a
ij
a
ik
a
kj
for i, j = k + 1, ..., n.
Once the decomposition PA = LU is found, the equation Ax = b can be easily solved. The following
algorithm overwrites b with the solution x:
Overwrite b by Pb:
If p(i) > i, swap b
i
with b
pi
, for i = 1, 2, ..., n 1,
Overwrite Pb by y from Ly = Pb using forward substitution.
b
i
:= b
i
a
i1
b
1
a
i2
b
2
... a
i,i1
b
i1
, for i = 2, 3, ..., n.
Overwrite y by x from Ux = y using backward substitution.
b
i
:= [b
i
a
i,i+1
b
i+1
... a
in
b
n
]/a
ii
for i = n, n 1, ..., 1.
Here is an example:
A =
_
_
1 5 3
2 4 2
3 0 1
_
_
=
_
_
3 0 1
2 4 2
1 5 3
_
_
=
_
_
3 0 1
2/3 4 4/3
1/3 5 10/3
_
_
=
_
_
3 0 1
1/3 5 10/3
2/3 4 4/3
_
_
=
_
_
3 0 1
1/3 5 10/3
2/3 4/5 4
_
_
Finally,
L =
_
_
1
1/3 1
2/3 4/5 1
_
_
, U =
_
_
3 0 1
5 10/3
4
_
_
.
45
Also, we found p
1
= 3 and p
2
= 3. We can obtain matrix P as follows: P = P
2
P
1
, where P
1
is the
permutation matrix that exchanges the 1st and 3rd rows, P
2
is the permutation matrix that exchanges
the 2nd and 3rd rows:
P
1
=
_
_
0 0 1
0 1 0
1 0 0
_
_
, P
2
=
_
_
1 0 0
0 0 1
0 1 0
_
_
, P = P
2
P
1
=
_
_
0 0 1
1 0 0
0 1 0
_
_
.
If we start with a 3 3 identity matrix, exchange the 1st row with the 3rd row, then exchange the 2nd
and 3rd row, we also get the right P.
To solve a system of equations Ax = b, we can use the decomposition PA = LU. First, we nd Pb,
then we solve Ly = Pb, nally, we solve Ux = y. Let us consider:
_
_
1 5 3
2 4 2
3 0 1
_
_
x =
_
_
4
10
7
_
_
.
We have:
Pb =
_
_
7
4
10
_
_
.
Now solve y based on:
_
_
1
1/3 1
2/3 4/5 1
_
_
_
_
y
1
y
2
y
3
_
_
=
_
_
7
4
10
_
_
.
We get:
_
_
y
1
y
2
y
3
_
_
=
_
_
7
5/3
4
_
_
.
Finally, we solve:
_
_
3 0 1
5 10/3
4
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
7
5/3
4
_
_
.
We get:
_
_
x
1
x
2
x
3
_
_
=
_
_
2
1
1
_
_
.
6.4 Matrix norm and condition number
Let us consider two matrices below:
A =
_
1000 0.0001
5000 4567
_
, B =
_
0.002 0.005
0.0003 0.005
_
.
They are both 2 2 matrices, what we see is that the entries of A are mostly larger (in absolute values)
than those of B. We would like to say that A is larger than B, even though the (1, 2) entry of A is
actually smaller than the entries of B. For that purpose, we need a concept called matrix norm. We
rst have the vector norm:
||x|| =
_
|x
1
|
2
+|x
2
|
2
+... +|x
n
|
2
, for x = (x
1
, x
2
, ..., x
n
)
T
.
46
The matrix norm of a matrix A is dened as
||A|| = max
x=0
||Ax||
||x||
.
If the size of A is mn, then the vector x above is a column vector of length n, Ax is a vector of length
m. The matrix norm is dened as the maximum of the ratio between these two vector norms, assuming
x is a non-zero vector. Meanwhile, if c is a scalar, we have
||cx|| = |c| ||x||, ||A(cx)|| = |c| ||Ax||.
That is,
||A(cx)||
||cx||
=
||Ax||
||x||
.
This implies that the matrix norm can be dened with the restriction that ||x|| = 1. That is,
||A|| = max
||x||=1
||Ax||.
Since ||A|| is the maximun of the ratio ||Ax|| and ||x||, we clearly have
||Ax|| ||A|| ||x||. (6.3)
If A is a real symmetric (of course, square) matrix, and the eigenvalues of A are
j
for j = 1, 2, ..., n,
satisfying
|
1
| |
2
| ... |
n
|,
then, we can prove that
||A|| = |
1
|.
If A is a real mn matrix, then A
T
A and AA
T
are real and symmetric. The largest eigenvalues of A
T
A
and AA
T
are actually the same. If we denote the largest eigenvalue of A
T
A or AA
T
by
1
, then we can
show that
||A|| =
1
.
When we solve linear system Ax = b, we sometimes have inaccurate solutions no matter what
method we use. This can be explained by the concept of condition number which is related to the
matrix norm. First we illustrate the problem by the following MATLAB code:
A = hilb(13);
x = ones(13,1);
b = A*x;
A\b
The rst line generates a 13 13 matrix called Hilbert matrix. The (i, j) entry of A is
a
ij
=
1
i +j 1
.
The second line generates a column vectors of all elements equal to 1. The third line calculates b = Ax,
the last line solves Ax = b. Of course the exact answer should be x, i.e. a column vector of 1s. But
what we get is:
47
>> A\b
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RCOND = 1.158544e-19.
ans =
1.0000
0.9997
1.0117
0.8071
2.7244
-8.3249
33.4625
-74.1604
117.9244
-119.7725
80.4201
-29.0922
6.0000
What happens is that the matrix A is very near singular. The inverse of A exists, but has very large
entries. That is ||A
1
|| is very large. It turns out that the accuracy of the solution has something to do
with the condition number of A dened as
(A) = ||A|| ||A
1
||.
When (A) is large, the numerical solution of Ax = b using nite precision oating point operations will
not be accurate.
To explain the relationship between accuracy and condition number, we need to compare Ax = b
with
(A +A)y = b,
where A (which is not multiplies A) is a small matrix (i.e., A has small matrix norm). In other
words, we change the coecient matrix A a little bit, then try to compare the solutions y and x. We
expect y is close to x, then let
y = x +x,
where x is a vector (with small norm) to be determined. We can write down the perturbed equation as
Ax + (A)x +A(x) + (A)(x) = b.
That gives
(A)x +A(x) + (A)(x) = 0.
If we consider the three term in the left hand side, we can see that the third term is much smaller than
the other two. Therefore,
(A)x +A(x) 0.
This gives rise to
x A
1
(A)x.
Therefore,
||x|| ||A
1
(A)x||.
48
From (6.3), we have
||A
1
(A)x|| ||A
1
|| ||(A)x|| ||A
1
|| ||(A)|| ||x||.
This implies
||x||
||x||
(A)
||A||
||A||
approximately,
where (A) = ||A|| ||A
1
||. The left hand side above is the relative error of the solution (measured
by vector norm). The right hand side is the relative error of the coecient matrix (measured by the
matrix norm). Therefore, (A) amplies the relative error. If (A) = 10
12
, then we expect to loss 12
digits in the solution. For double precision calculations, we expect ||A||/||A|| is about 10
15
, then if
(A) = 10
12
, the solution may still have three accurate digits. If (A) is larger than 10
15
, the solution
will be completely wrong. Now, for the 13 13 Hilbert matrix, we have (A) 4.74 10
18
. Therefore,
the solution for Ax = b in MATLAB is completely wrong.
49
Chapter 7
QR factorization and Least Squares
Problems
If A is an m n matrix where m > n, the linear system Ax = b usually has no solution, because there
are more equations (m equations) than unknowns (n unknowns). However, we can try to determine the
minimum of ||Axb||. This is the so-called least squares problem. In this chapter, we describe a method
to solve the least squares problem. It uses the QR factorization of A, that is A = QR, where Q is an
orthogonal matrix, R is an upper triangular matrix. To nd the QR factorization, we need Householder
reections.
7.1 The least squares problem
We consider a simple data tting problem. Suppose an experiment produces some data:
(s
i
, t
i
), i = 1, 2, ..., n.
The relationship between s and t is supposed to be linear,
t = s +
for some constants and . Since the experiment has small perturbations, the data do not perfectly t
into the linear formula. What we can do is to nd and , such that
F(, ) =
n
i=1
(s
i
+ t
i
)
2
is as small as possible. This is a very simple example of the least squares problem. Notice that, the
above is a multi-variable polynomial of degree 2 in and . Furthermore, we have
_
_
s
1
+ t
1
s
2
+ t
2
.
.
.
s
n
+ t
n
_
_
= Ax b,
where
A =
_
_
s
1
1
s
2
1
.
.
.
.
.
.
s
n
1
_
_
, x =
_
_
, b =
_
_
t
1
t
2
.
.
.
t
n
_
_
.
50
Therefore,
F(, ) = ||Ax b||
2
,
where the norm of a vector is just the Euclidean distance. That is, if z = (z
1
, z
2
, ..., z
m
)
T
, then the
norm ||z|| is
||z|| =
_
|z
1
|
2
+|z
2
|
2
+... +|z
m
|
2
.
Usually, we will use column vectors. The vector z above is a column vector, the superscript T is for
the transpose. We also call the above 2-norm, and denote this by ||z||
2
. Because, in general we have the
so-called p-norm (for p 1):
||z||
p
= (|z
1
|
p
+|z
2
|
p
+... +|z
m
|
p
)
1/p
.
Therefore, our date tting problem is
min
x
||Ax b||
2
.
This is in fact the general least squares problem in linear algebra.
Next, we consider an approximation problem for functions. For a given function b(t) dened on the
interval [c, d], we can try to approximate b(t) by a summation of simpler functions:
b(t) x
1
a
1
(t) +x
2
a
2
(t) +... +x
n
a
n
(t),
where a
1
(t), a
2
(t), ..., a
n
(t) are given simple functions, x
1
, x
2
, ..., x
n
are unknown coecients. One way
to determine the coecients is to nd the minimum
min
x1,x2,...,xn
_
d
c
[x
1
a
1
(t) +x
2
a
2
(t) +... +x
n
a
n
(t) b(t)]
2
dt.
This is the least squares problem for function approximation. The integral above is actually a quadratic
polynomial (multi-vairable) of x
1
, x
2
, ..., x
n
, since b(t), a
1
(t), ..., a
n
(t) are given and you can expand
the square and integrate. To solve the minimization problem, you can use the result of calculus, that
is, the partial derivative of the integral with respect to x
j
is zero and you can solve x
1
, x
2
, ..., x
n
from
these conditions.
If we let x = [x
1
, x
2
, ..., x
n
]
T
and A = [a
1
(t), a
2
(t), ..., a
n
(t)], then the problem can be written as
min
x
_
d
c
[A(t)x b(t)]
2
dt.
If we imagine that t is discretized with m points, a
j
(t) is replaced by a vector of length m, the integral
is approximated by a numerical integration scheme, then the problem is related to
min
x
||Ax b||
2
where b is a column vector of length m, A is a mn vector (usually m > n), x is the unknown vector
of length n. This is again the least squares problem in linear algebra.
Notice that, we can remove the square in ||Ax b||
2
and consider
min
x
||Ax b||. (7.1)
The result should be the same. Here, we assume A is mn, where m n, b is m1 and x is n 1. If
we assume that the columns of A are linearly independent (so the rank of A is n), then the least squares
problem can be solved by
(A
T
A)x = A
T
b. (7.2)
51
I will explain (7.2) by a geometry problem. In 3-D space, consider a plane passing through the origin
and the vectors a
1
and a
2
(which are column vectors with three components). An arbitrary point W in
the plane can be written as
x
1
a
1
+x
2
a
2
= [a
1
, a
2
]
_
x
1
x
2
_
= Ax,
where x
1
and x
2
are some coecients. Now, consider a point B specied by the vector b. Then the
distance between B and the point W in the plane is
||Ax b||.
To nd the shortest distance between B and the plane, we need to solve
min
x
||Ax b||.
This is exactly the least squares problem. On the other hand, we know that, if W is the point which
gives the shortest distance, then the vector WB = bAx must be perpendicular to the plane. Therefore,
b Ax must be perpendicular to both a
1
and a
2
. Thus,
a
T
1
(b Ax) = 0, a
T
2
(b Ax) = 0.
This is exactly,
A
T
(b Ax) = 0
and it is the same as (7.2).
In conclusion, the least squares problem is (7.1) and it can be solved from (7.2). What we do in
the remainder of this chapter, is to nd a more ecient method to solve (7.1). Our method also gives
more accurate results for a very large A, when oating point operations are used. The method uses the
so-called QR factorization of A.
7.2 Householder reection
An orthogonal matrix Q is a real matrix (no complex entries), such that
Q
1
= Q
T
.
Of course, this implies that Q
T
Q = QQ
T
= I. If Q is m m, we can write down the columns of Q as
q
1
, q
2
, ..., q
m
, that is
Q = [q
1
, q
2
, ..., q
m
],
then, from the condition Q
T
Q = I, we get
q
T
i
q
j
=
_
1 if i = j
0 if i = j
.
Since ||q
i
|| =
_
q
T
i
q
i
, we conclude that the column vectors are all unit vectors, i.e. ||q
i
|| = 1, and they
are orthogonal to each other.
Householder reection is a special orthogonal matrix. It allows us to nd the so-called QR factoriza-
tion of a matrix, i.e. A = QR, where Q is orthogonal and R is upper triangular. The QR factorization
will be used to solve the least squares problem: min ||Ax b||.
For a given column vector x = (x
1
, x
2
, ..., x
m
)
T
, the Householder reection is an orthogonal matrix
H such that
Hx =
_
0
.
.
.
0
_
_
.
52
It is easy to nd . Since
2
= ||Hx||
2
= (Hx)
T
(Hx) = x
T
H
T
Hx = x
T
x = ||x||
2
Therefore,
= ||x||
The formula for H is
H = I
2
v
T
v
vv
T
, where v =
_
_
x
1
x
2
.
.
.
x
m
_
_
=
_
_
x
1
||x||
x
2
.
.
.
x
m
_
_
Instead of giving a proof, we can try to derive the formula for H.
1. Let w be a unit column vector, nd scalar , such that
H = I ww
T
is an orthogonal matrix. Answer: = 2.
2. Determine w, such that
(I 2ww
T
)x =
_
0
.
.
.
0
_
_
,
where = ||x||.
From
x 2(w
T
x)w =
_
0
.
.
.
0
_
_
we obtain
_
_
x
1
x
2
.
.
.
x
m
_
_
= v = 2(w
T
x)w
Therefore, w is parallel to v. But w is also a unit vector. Therefore,
w =
v
||v||
.
This leads to
H = I 2ww
T
= I
2
v
T
v
vv
T
.
53
7.3 QR factorization
We outline the procedure for computing the QR factorization using Householder reections here. The
basic idea is to construct a sequence of Householder reections such that
H
1
A =
_
_
...
0 ...
.
.
.
.
.
.
.
.
.
0 ...
_
_
H
2
H
1
A =
_
_
...
0 ...
0 0 ...
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ...
_
_
and if we assume A is mn with m > n, then
(H
n
H
n1
...H
2
H
1
)A = R =
_
_
...
0 ...
.
.
.
.
.
.
.
.
.
.
.
.
0
.
.
.
_
_
.
This can be written as A = QR, where
Q = (H
n
...H
2
H
1
)
1
= H
1
H
2
...H
n
since H
i
is symmetric and orthogonal. If the 1st column of A is [a
11
, a
21
, ..., a
n1
]
T
, we choose
r
11
=
_
a
2
11
+a
2
21
+... +a
2
n1
(choose the sign of r
11
opposite to the sign of a
11
, explained later), and let
v
1
=
_
_
a
11
r
11
a
21
.
.
.
a
n1
_
_
and H
1
= I
2
v
T
1
v
1
v
1
v
T
1
.
If the second column of H
1
A (not A!) is [a
21
, a
22
, ..., a
n2
], we choose
r
22
=
_
a
2
22
+a
2
32
+... +a
2
n2
(the sign of r
22
is opposite to that of a
22
) and let
v
2
=
_
_
0
a
22
r
22
a
32
.
.
.
a
n2
_
_
and H
2
= I
2
v
T
2
v
2
v
2
v
T
2
.
Remark: We want to avoid subtraction in a
11
r
11
and a
22
r
22
. Therefore, we choose the sign of r
11
to be dierent from the sign of a
11
, etc. Subtraction of nearly equal numbers leads to a loss of signicant
digits.
To evaluate H
1
A eciently, we use the following procedure:
54
calculate
0
= 2/(v
T
1
v
1
);
for j = 2, 3, ..., n
calculate
1
= v
T
1
w, where w is the j-th column of A,
set =
0
1
,
set the j-th column of H
1
A as w v
1
.
Notice that
H
1
w =
_
I
2
v
T
1
v
1
v
1
v
T
1
_
w = w
0
v
1
(v
T
1
w) = w
0
v
1
1
= w v
1
The evaluation of
1
and w v
1
requires 2n operations, respectively. The total number of operations
for the 1st step (i.e. H
1
A) is about 4n
2
. For the second step, we notice that
H
2
=
_
1
H
2
_
where
H
2
= I
n1
2
v
T
2
v
2
v
2
v
T
2
for v
2
=
_
_
a
22
r
22
a
32
.
.
.
a
n2
_
_
.
Therefore, we only need to multiply

H
2
to the trailing (n 1) (n 1) matrix of H
1
A. This requires
about 4(n 1)
2
operations. The total number of operations for QR factorization of an n n matrix is
thus
4n
2
+ 4(n 1)
2
+... + 4 2
2
4
3
n
3
.
7.4 Solving least squares problems
The least squares problem can be solved with a QR factorization of A. Using Householder reection, we
can nd
A = QR
where Q is mm orthgonogal, R is mn upper triangular. For example, if A is 4 3, b is a vector of
length 4, then Q is a 4 4 orthogonal matrix,
R =
_
R
1
0
_
=
_
_
r
11
r
12
r
13
r
22
r
23
r
33
0
_
_
is a 4 3 upper triangular matrix. Now,
||Ax b|| = ||QRx b|| = ||Rx Q
T
b||
(see note later). If we still assume A is 4 3 and we can write vector Q
T
b as
Q
T
b =
_
4
_
_
=
_
c
4
_
for c =
_
_
3
_
_
55
then
||Ax b|| = ||
_
R
1
x
0
_
_
c
4
_
|| = (||R
1
x c||
2
+|
4
|
2
)
1/2
|
4
|
Clearly, if R
1
is invertible, we could let
R
1
x = c
then ||Ax b|| reaches the minimum value |
4
|. Since we need the condition that R
1
is invertible, we
need to assume that rank(A) = n.
Note: If Q is orthogonal, y is a column vector, we have
||Qy|| =
_
(Qy)
T
(Qy) =
_
y
T
Q
T
Qy =
_
y
T
y = ||y||.
Now, since QRx b = Q(Rx Q
T
b), we have
||Ax b|| = ||Rx Q
T
b||
Remark: To solve the least squares problem, we do not explicitly generate the matrix Q, we have
Q
T
= H
n1
...H
2
H
1
. Therefore, we could just evaluate Q
T
b by the sequence of Householder reections.
56
Chapter 8
Matrix Eigenvalue Problem
In this chapter, we consider numerical methods for eigenvalue problem. The eigenvalues are zeros of
the polynomial
p() = det(I A)
However, it is a bad idea to calculate this polynomial in nite precision oating point arithmetics. This
is explained in section 1. In section 2, we present elementary methods. The last two sections explain
the most widely used numerical method for computing eigenvalues and eigenvectors.
8.1 Introduction
Let A be a square matrix, an eigenvalue is a solution of
det(A I) = 0.
Corresponding to an eigenvalue , we have an eigenvector x which is a non-zero solution of
(A I)x = 0, or Ax = x.
The eigenvectors are determined up to an arbitrary constant. We will consider real and symmetric matri-
ces. This chapter briey describes an ecient algorithm for computing the eigenvalues and eigenvectors
of a real symmetric matrix.
From the theoretical side, for a real symmetric matrix A, we have
1. The eigenvectors of two distinct eigenvalues are orthogonal. More precisely, if Ax = x, Ay = y
and = , then
x
T
y = y
T
x = 0
2. If is an eigenvalue of A, with multiplicity m, then there are m linearly independent eigenvectors
corresponding to .
3. Let the n eigenvalues of the n n real symmetric matrix A be
1
,
2
, ...,
n
it is possible to choose the corresponding eigenvectors v
1
, v
2
, ..., v
n
to satisfy:
v
j
is a unit eigenvector for all j;
v
i
v
j
if i = j.
57
It is important to realize that the matrix of the eigenvectors (as in item 3 above)
V = [v
1
, v
2
, ..., v
n
]
is orthogonal. You can easily verify that V
T
V = I.
(Spectral decomposition) Let A be a real symmetric matrix,
1
,
2
, ...,
n
be the eigenvalues of A,
V is the orthogonal matrix of the corresponding unit eigenvectors, then
A = V
_
2
.
.
.
n
_
_
V
T
.
This is a very important result in linear algebra. The objective of this chapter is to compute the above
decomposition.
The characteristic polynomial is
p() = det(I A).
Classical linear algebra textbooks usually suggest that you can nd the polynomial
p() =
n
+c
n1
n1
+... +c
1
x +c
0
then nd the zeros of p() for the eigenvalues. This works for small (say, 2 2 or 3 3) matrices.
But remember that we are doing numerical computations. For large matrices, we only nd approximate
values for the polynomial coecients c
n1
, ..., c
1
, c
0
. What is surprising is that a small error in
polynomial coecients can lead to very large error in the zeros of the polynomial.
Wilkinsons Example: For n = 20, p() = ( 1)( 2)...( 20) can be regarded as the
characteristic polynomial of the diagonal matrix
A = diag{1, 2, ..., 20}.
We can write down c
0
and c
19
,
p() =
20
210
19
+.... 20!
Now, let us introduce a small error to the coecient c
19
, say = 10
9
, and consider the polynomial
p() =
20
+ (210 +)
19
+... 20! = ( 1)( 2)...( 20) +
19
Now, if we try to solve p() = 0, we will get 3 complex conjugate pairs. In particular, the original zeros
15, 16 now becomes
15.457790724 0.899341526i
The conclusion is that the zeros of a polynomial are very sensitive to its coecients. If we only calculate
the polynomial coecients approximately, we can not have accurate eigenvalues in general. In other
words, the approach based on characteristic polynomial is bad.
8.2 Power, inverse power and Rayleigh quotient iterations
Power iterations: Let A be an n n matrix,
1
,
2
, ...,
n
be its eigenvalues. Let
1
be the largest
(actually dominant) eigenvalue in absolute value. That is
|
1
| > |
j
| for j = 2, 3, ..., n.
The power iteration method can be used to calculate the eigenvector corresponding to
1
.
58
Set x
0
as an arbitrary vector (initial guess);
For k = 1, 2, ...,
x
k
=
Ax
k1
||Ax
k1
||
The basic theory is that for almost any initial vector x
0
, the sequence {x
k
} converges to a unit
eigenvector:
lim
k
[Ax
k
1
x
k
] = 0
Inverse Power Iterations: Let A be n n and non-singular,
1
,
2
, ...,
n
be the eigenvalues of
A. We assume
0 < |
1
| < |
j
| for j = 1.
The inverse power iteration method can be used to nd the eigenvector corresponding to
1
. It is
mathematically equivalent to the power method applied to A
1
(the eigenvalues of A
1
are 1/
1
, 1/
2
,
..., 1/
n
, with the dominant eigenvalue being 1/
1
). But we do not nd A
1
rst and then use the
power iteration method.
Set x
0
as an arbitrary non-zero vector (initial guess).
For k = 1, 2, 3, ..., solve z from
Az = x
k1
and set x
k
as
x
k
=
z
||z||
.
Notice that
x
k
=
(A
1
)
k
x
0
||(A
1
)
k
x
0
||
.
Rayleigh Quotient Iteration: If x is an eigenvector of the matrix A, how do you calculate the
eigenvalue (when x is given)? We have
Ax = x, thus x
T
Ax = x
T
x.
Therefore,
=
x
T
Ax
x
T
x
.
The term x
T
Ax/(x
T
x) is the Rayleigh quotient. The Rayleigh quotient iteration method is a procedure
to calculate an eigenvalue and an eigenvector of a matrix. The basic idea is to apply the inverse power
method to AI, where is the approximate eigenvalue.
Let x
0
= arbitrary initial guess.
For k = 1, 2, 3, ...,
=
x
T
k1
Ax
k1
x
T
k1
x
k1
solve z from
(A I)z = x
k1
set x
k
as
x
k
=
z
||z||
.
The method has a very fast convergence. But it is not known which eigenvalue/eigenvector pair will it
converge to, for an arbitrary initial guess x
0
. The method developed in the next a few sections allows
us to nd all the eigenvalues (and eigenvectors) of a matrix in a more systematic way.
59
8.3 Orthogonal reduction
The eigenvalue problem of a symmetric matrix can be reduced to the eigenvalue problem of a tridiag-
onal matrix. We use Householder reections to nd the matrix T (symmetric and tridiagonal) and an
orthogonal matrix Q, such that
A = QTQ
T
.
We notice that
det(I A) = det(Q(I T)Q
T
) = det Qdet(I T) det Q
T
= det(I T).
In fact, from QQ
T
= I, we obtain det(Q) det(Q)
T
= (det(Q))
2
= 1 or det(Q) = 1.
To nd the decomposition A = QTQ
T
, we actually nd
Q
T
AQ = T
and Q is obtained as a product of Householder reections. To illustrate this, we consider the case of a
4 4 matrix A
A =
_
_

_
_
.
The main steps are:
1. Find H
1
(Householder reection), such that
H
1
AH
T
1
=
_
_
0 0

0
0
_
_
.
2. Find H
2
(Householder reection), such that
H
2
H
1
AH
T
1
H
T
2
=
_
_
0 0
0
0
0 0
_
_
= T.
Now, the matrix Q is given by
Q
T
= H
2
H
1
, or Q = H
T
1
H
T
2
= H
1
H
2
.
The matrix T is also symmetric, since
T
T
= Q
T
A
T
(Q
T
)
T
= Q
T
AQ = T.
For the matrix H
1
, we construct this matrix by a Householder reection that works on rows 2, 3, 4
(keep row 1 unchanged). This will produce two zeros in row 3 and row 4. What about H
T
1
multiplied
from the right side of the matrix? It will work on the columns 2, 3 and 4, and it will not change column
1. Let us write down the matrix A as
A =
_
_
a
1
a
2
a
3
a
4
a
2

a
3

a
4

_
_
.
60
We construct the 3 3 Householder reection

H
1
such that
H
1
_
_
a
2
a
3
a
4
_
_
=
_
_
1
0
0
_
_
where
1
=
_
a
2
2
+a
2
3
+a
2
4
and
H
1
= I
2
v
T
1
v
1
v
1
v
T
1
, for v
1
=
_
_
a
2
1
a
3
a
4
_
_
=
_
_
a
2
_
a
2
2
+a
2
3
+ a
2
4
a
3
a
4
_
_
.
Now the 4 4 matrix H
1
is given by
H
1
=
_
1
H
1
_
.
Now, we have
H
1
AH
T
1
=
_
_
a
1

1
0 0
1

0
0
_
_
.
Let us assume that H
1
AH
T
1
can be further written as
H
1
AH
T
1
=
_
_
a
1

1
0 0
1
b
2
b
3
b
4
0 b
3

0 b
4

_
_
.
Now we nd H
2
based on the Householder reection for row 3 and row 4. We have
H
2
H
1
AH
T
1
H
T
2
=
_
_
a
1

1
0 0
1
b
2

2
0
0
2

0 0
_
_
,
where
2
=
_
b
2
3
+b
2
4
and
H
2
=
_
_
1
1
H
2
_
_
,

H
2
= I
2
v
T
2
v
2
v
2
v
T
2
, v
2
=
_
b
3
2
b
4
_
=
_
b
3
_
b
2
3
+b
2
4
b
4
_
.
Next, we give some details for an ecient implementation of this tridiagonalization process. Let us
denote the matrix A by
A =
_
a
1

T
1
1

A
_
where
T
1
= (a
2
, a
3
, a
4
), and
H
1
AH
T
1
=
_
a
1

T
1
1
A
1
_
where
T
1
= (
1
, 0, 0). The matrix

A is the given 3 3 matrix obtained from A by retaining the last
three rows and columns. The matrix A
1
is what we need to calculate. They are related by
A
1
=

H
1

A

H
T
1
.
61
Since

H
1
has a special simple form, we can eciently evaluate A
1
. Let = 2/(v
T
1
v
1
), we have

H
1
=
I v
1
v
T
1
and
A
1
= (I v
1
v
T
1
)

A(I v
1
v
T
1
) =

A

Av
1
v
T
1
v
1
v
T
1

A +
2
v
1
v
T
1

Av
1
v
T
1
This can be written as
A
1
=

A +gv
T
1
+v
1
g
T
for
g = u +

2
(v
T
1
u)
2
v
1
where u =

Av
1
.
Thus, the evaluation of A
1
follows these steps:
u =

Av
1
= v
T
1
u,
g = u +

2
2
v
1
A
1
=

A+gv
T
1
+v
1
g
T
.
For a general nn matrix, the rst and last two steps require about 2n
2
operations each. Each elements
of A
1
requires 4 operations to calculate, but A
1
is a symmetric matrix and only about n
2
/2 entries
need to be calculated. The 2nd and 3rd steps require O(n) operations. The total number of operations
required for evaluating H
1
AH
T
1
is around 4n
2
. The whole process of reduction to a tridiagonal matrix
requires about
4
3
n
3
operations.
8.4 The QR algorithm
The QR algorithm is a widely used method for calculating the eigenvalues and eigenvectors. Since a
general real symmetric matrix A can be reduced to a symmetric tridiagonal matrix T, we concentrate
on the eigenvalue problem of T here. The basic idea is the following transformation:
QR = T sI
T = sI +RQ.
For a given tridiagonal matrix T, we choose a real number s, then nd the QR factorization of the
matrix T sI. This gives T = sI +QR, but we calculate the new matrix

T = sI +RQ. Since in general
QR = RQ, we expect

T = T. However, we have

T is also symmetric tridiagonal;

T has the same eigenvalues as T.
Since

T = sI + Q
T
QRQ = Q
T
(sI + QR)Q = Q
T
TQ, we obtain the symmetry

T
T
=

T and the fact of
same eigenvalues (see similar arguments in the previous section for A and T). The fact that

T is also
tridiagonal, can be proved.
If we denote the original matrix T (obtained in the previous section through the reduction from A)
as T
0
and

T by T
1
, then we can repeat the above transformation for T
1
to obtain T
2
, etc. That is, for
k = 0, 1, 2, ...
nd a real number s
k
(somehow), and nd the QR factorization of T
k
s
k
I,
QR = T
k
s
k
I
62
calculate T
k+1
T
k+1
= s
k
I +RQ.
The hope is that
T
k

_
.
.
.
.
.
.
.
.
.
.
.
.
0
0
1
_
_
=
_

T
0
1
_
as k , where
1
is an eigenvalue of A (also T
0
, T
1
, ...), but it is not necessarily the largest or smallest
eigenvalue. More precisely, we hope the (n, n) entry of T
k
converges to an eigenvalue and the (n 1, n)
entry (which is the same as the (n, n1) entry) converges to 0. If this works out, we nd one eigenvalue
1
and then we continue with the (n 1) (n 1) matrix

T
0
.
The number s (or s
k
) is called the shift. One possible choice is to choose s
k
as the (n, n) entry of
T
k
. The Wilkinson shift is to choose s
k
as the eigenvalue of the trailing 2 2 matrix of T
k
which
is closer to the (n, n) entry. More precisely, we denote
T
k
=
_
(k)
1

(k)
1
(k)
1
.
.
.
.
.
.
.
.
.
(k)
n1

(k)
n1
(k)
n1

(k)
n
_
_
Thus, s
k
is an eigenvalue of
_
(k)
n1

(k)
n1
(k)
n1

(k)
n
_
. Since this matrix has two eigenvalues, we choose s
k
to be
the eigenvalue that is closer to
(k)
n
.
63
Chapter 9
Runge-Kutta Methods
9.1 Introduction
In this chapter, we study numerical methods for initial value problems (IVP) of ordinary dierential
equations (ODE). The rst step is to re-formulate your ODE as a system of rst order ODEs:
dy
dt
= f(t, y) for t > t
0
(9.1)
with the initial condition
y(t
0
) = y
0
(9.2)
where t is the independent variable, y = y(t) is the unknown function of t, y
0
is the given initial condition,
and f is a given function of t and y which describes the dierential equation. High order dierential
equations can also be written as a rst order system by introducing the derivatives as new functions.
Our numerical methods can be used to solve any ordinary dierential equations. We only need to specify
the function f.
The variable t is discretized, say t
j
for j = 0, 1, 2, ..., then we determine y
j
y(t
j
) for j = 1, 2, 3, ....
We will only consider one-step methods. If y
j
is calculated, then we construct y
j+1
from y
j
. Previous
values such as y
j1
are not needed. Since this is an IVP and for the rst step, we have y
0
only at t
0
, then
we can nd y
1
, y
2
, ..., in a sequence. A higher order method gives a more accurate numerical solution
than a lower order method for a xed step size. But a higher order one-step method requires more
evaluations of the f function. For example, the rst order Eulers method requires only one evaluation
of f, i.e., f(t
j
, y
j
), but a fourth order Runge-Kutta method requires four evaluations of f.
Consider the following example. We have the following dierential equation for u = u(t):
u
+ sin(t)
_
1 + (u
)
2
u
+
u
1 +e
t
= t
2
(9.3)
for t > 0, with the initial conditions:
u(0) = 1, u
(0) = 2, u
(0) = 3. (9.4)
We can introduce a vector y
y(t) =
_
_
u(t)
u
(t)
u
(t)
_
_
and write down the equation for y as
y
= f(t, y) =
_
_
u
sin(t)
_
1 + (u
)
2
u
u/(1 +e
t
) +t
2
_
_
64
The initial condition is y(0) = [1, 2, 3]. Here is a simple MATLAB program for the above function f.
function k = f(t, y)
% remember y is a column vector of three components.
k = zeros(3,1);
k(1) = y(2);
k(2) = y(3);
k(3) = -sin(t) * sqrt(1+y(3)^2) * y(2) - y(1)/(1 + exp(-t)) + t^2;
In the MATLAB program, y(1), y(2), y(3) are the three components of the vector y. They are
u(t), u
(t) and u
(t), respectively. They are dierent from y(1), y(2) and y(3) which are the vectors y
evaluated at t = 1, t = 2 and t = 3. Notice that we also have y(0), which is the initial value of y. But
we do not have y(0). Anyway, the components of y are only used inside the MATLAB programs.
A numerical method is usually given for the general system (9.1-9.2). We specify the system of ODEs
by writing a program for the function f, then the same numerical method can be easily used for solving
many dierent dierential equations.
9.2 Euler and Runge-Kutta Methods
Numerical methods start with a discretization of t by t
0
, t
1
, t
2
, ..., say
t
j
= t
0
+jh
where h is the step size. Numerical methods are formulas for y
1
, y
2
, y
3
, ..., where y
j
is the approximate
solution at t
j
. We use y(t
j
) to denote the (unknown) exact solution, thus
y
j
y(t
j
).
Please notice that when y is a vector, y
1
, y
2
, ..., are also vectors. In particular, y
1
is not the rst
component of y vector, y
2
is not the 2nd component of the y vector. The components of y are only
explicitly given inside the MATLAB programs as y(1), y(2), etc.
Eulers method:
y
j+1
= y
j
+ hf(t
j
, y
j
). (9.5)
Since y
0
is the known initial condition, the above formula allows us o nd y
1
, y
2
, etc, in a sequence. The
Eulers method can be easily derived as follows. First, we assume h is small and consider the Taylor
expansion:
y(t
1
) = y(t
0
+h) = y(t
0
) +hy
(t
0
) +...
Now, we know that y
(t
0
) = f(t
0
, y(t
0
)). If we keep only the rst two terms of the Taylor series, we
obtain the rst step of Eulers method:
y
1
= y
0
+hf(t
0
, y
0
),
where y(t
1
) is replaced by the numerical solution y
1
, etc. The general step from t
j
to t
j+1
is similar.
Here is a MATLAB program for the Eulers method:
function y1 = eulerstep(h, t0, y0)
% This is one step of the Eulers method. It is
% given for the first step, but any other step
% is just the same. You need the MATLAB function
% f to specify the system of ODEs.
y1 = y0 + h* f(t0, y0)
65
Now, let us solve (9.3-9.4) from t = 0 to t = 1 with the step size h = 0.01. For this purpose, we
need to write a main program. In the main program, we specify the initial conditions, initial time t
0
,
nal time and the total number of steps. The step size can then be calculated. Here is the MATLAB
program.
% The main program to solve (1.3)-(1.4) from t=0 to
% t = 1 by Eulers method.
% initial time
t0 = 0;
% final time
tfinal = 1;
% number of steps
nsteps = 100;
% step size
h = (tfinal - t0)/ nsteps;
% initial conditions
y = [1, 2, 3];
% set the variable t.
t = t0
% go through the steps.
for j= 1 : nsteps
y = eulerstep(h, t, y)
t = t + h
% saved output for u(t) only, i.e. the first component of y.
tout(j) = t;
u(j) = y(1);
end
% draw a figure for the solution u.
plot(tout, u)
Now, insider MATLAB, in a folder containing the three programs: f.m, eulerstep.m, eulermain.m,
if we type eulermain, we will see a solution curve. That is the solid curve in Fig. 9.1. This is for the
case of h = 0.01. We also want to see what happens if h is 0.2 and 0.1. For this purpose, we change
nsteps to 5 and 10, then use plot(tout, u, o) and plot(tout, u, +) to show the results. All
three plots are shown in the Fig. 9.1.
The Eulers method is not very accurate. To obtain a numerical solution with an acceptable accuracy,
we have to use a very small step size h. A small step size h implies a larger number of steps, thus more
computing time. It is desirable to develop methods that are more accurate than Eulers method. If we
look at the Taylor series again, we have
y(t
1
) = y(t
0
+h) = y(t
0
) +hy
(t
0
) +
h
2
2
y
(t
0
) +
h
3
6
y
(t
0
) +...
This can be written as
y(t
1
) y(t
0
)
h
= y
(t
0
) +
h
2
y
(t
0
) +
h
2
6
y
(t
0
) +... (9.6)
Actually, the right hand side is a more accurate approximation for y
(t
0
+h/2), since
y
(t
0
+
h
2
) = y
(t
0
) +
h
2
y
(t
0
) +
h
2
8
y
(t
0
) +...
66
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
1.5
2
2.5
3
3.5
4
4.5
5
t
u
(
t
)
Numerical solutions by Eulers method using h=0.01, 0.1, 0.2
Figure 9.1: Numerical solutions of (9.3) and (9.4) by Eulers method. The solid curve is for h = 0.01.
The + is for h = 0.1 and the o is for h = 0.2.
The rst two terms on the right hand sides of the above two equations are identical, although the third
terms involving y
(t
0
) are dierent. Thus,
y(t
1
) y(t
0
)
h
y
(t
0
+
h
2
) = f
_
t
0
+
h
2
, y(t
0
+
h
2
)
_
The right hand side now involves y(t
0
+h/2). Of course, this is not known, because we only have y(t
0
).
The idea is that we can use Eulers method (with half step size h/2) to get an approximate y(t
0
+h/2),
then use the above to get an approximation of y(t
1
). The Euler approximation for y(t
0
+ h/2) is
y(t
0
) +h/2f(t
0
, y
0
). Therefore, we have
k
1
= f(t
0
, y
0
) (9.7)
k
2
= f(t
0
+
h
2
, y
0
+
h
2
k
1
) (9.8)
y
1
= y
0
+hk
2
. (9.9)
This is the rst step of the so-called midpoint method. The general step is obtained by simply replacing
t
0
, y
0
and y
1
by t
j
, y
j
and y
j+1
, respectively.
The right hand side of (9.6) can also be approximated by (y
(t
0
) +y
(t
1
))/2, because
y
(t
0
) +y
(t
1
)
2
= y
(t
0
) +
h
2
y
(t
0
) +
h
2
4
y
(t
0
) +...
Therefore, we have
y(t
1
) y(t
0
)
h

y
(t
0
) +y
(t
1
)
2
.
We can replace y
(t
0
) and y
(t
1
) by f(t
0
, y(t
0
)) and f(t
1
, y(t
1
)), but of course, we do not know y(t
1
),
because that is what we are trying to solve. But we can use Eulers method to get the rst approximation
of y(t
1
) and use it in f(t
1
, y(t
1
)), then use the above to get the second (and better) approximation of
67
y(t
1
). This can be summarized as
k
1
= f(t
0
, y
0
) (9.10)
k
2
= f(t
0
+h, y
0
+hk
1
) (9.11)
y
1
= y
0
+
h
2
(k
1
+k
2
). (9.12)
This is the rst step of the so-called modied Eulers method. The general step from t
j
to t
j+1
is
easily obtained by replacing the subscripts 0 and 1 by j and j + 1, respectively.
Similarly, the right hand side of (9.6) can be approximated by
Ay
(t
0
) +By
(t
0
+ h),
where is a given constant, 0 < 1, the coecients A and B can be determined, such that the above
matches the rst two terms of the right hand side of (9.6). We obtain
A = 1
1
2
, B =
1
2
.
Then y
(t
0
+h) = f(t
0
+h, y(t
0
+h)) and we use Eulers method to approximate y(t
0
+h). That
is
y(t
0
+h) y(t
0
) +hf(t
0
, y(t
0
)).
Finally, we obtain the following general 2nd order Runge-Kutta Methods:
k
1
= f(t
j
, y
j
) (9.13)
k
2
= f(t
j
+h, y
j
+hk
1
) (9.14)
y
j+1
= y
j
+h
__
1
1
2
_
k
1
+
1
2
k
2
_
(9.15)
Since is an arbitrary parameter, there are innitely many 2nd order Runge-Kutta methods. The
midpoint method and the modied Eulers method correspond to = 1/2 and = 1, respectively. In
this formula, k
1
and k
2
are temporary variables, they are dierent for dierent steps.
There are many other Runge-Kutta methods (3rd order, 4th order and higher order). The following
classical 4th order Runge-Kutta method is widely used, because it is quite easy to remember.
k
1
= f(t
j
, y
j
) (9.16)
k
2
= f(t
j
+
h
2
, y
j
+
h
2
k
1
) (9.17)
k
3
= f(t
j
+
h
2
, y
j
+
h
2
k
2
) (9.18)
k
4
= f(t
j
+h, y
j
+hk
3
) (9.19)
y
j+1
= y
j
+
h
6
(k
1
+ 2k
2
+ 2k
3
+k
4
) (9.20)
We have mentioned the order of a method above. This concept will be explained in the next section.
Next, we consider a MATLAB implementation of the midpoint method. For this purpose, we write
the following function called midptstep which is saved in the le called midptstep.m.
function y1 = midptstep(h, t0, y0)
% This is midpoint method (one of the second order Runge-Kutta methods).
% It is given for the first step, but any other step is just the same.
% You need the MATLAB function f to specify the system of ODEs.
k1 = f(t0, y0);
k2 = f(t0+h/2, y0 + (h/2)*k1)
y1 = y0 + h* k2;
68
To solve the same dierential equation (9.3-9.4), we need the earlier MATLAB function f and a main pro-
gram. We can write a main program by copying the main program eulermain for Eulers method. The
new main program midptmain is dierent from eulermain only in one line. The line y = eulerstep(h,
t, y) is now replaced by
y = midptstep(h, t, y)
You can see that writing a program for a new method is very easy, since we have separated the dierential
equation (in f.m ) and the numerical method (in eulerstep.m or midptstep.m) from the main program.
In Fig. 9.2, we show the numerical solution u(t) for (9.3-9.4) calculated by the midpoint method with
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
1.5
2
2.5
3
3.5
4
4.5
5
t
u
(
t
)
Numerical solutions by the midpoint method for h=0.01 and h=0.2
Figure 9.2: Numerical solutions by the midpoint method. The solid curve is for h = 0.01. The o is for
h = 0.2.
h = 0.01 and h = 0.2. You can see that the midpoint solution obtained with h = 0.2 is much more
accurate than the Eulers solution with the same h.
9.3 Local truncation error and order
When a numerical method is used to solve a dierential equation, we want to know how accurate is the
numerical solution. We will denote the exact solution as y(t), thus y(t
j
) is the exact solution at t
j
. The
numerical solution at t
j
is denoted as y
j
, therefore, we are interested in the following error:
e
j
= |y(t
j
) y
j
|.
We do not expect to be able to know e
j
exactly, because we do not have the exact solution in general.
Therefore, we will be happy to have some estimates (such as approximate formulas or inequalities) for
e
j
. However, even this is not so easy. The reason is that the error accumulates. Let us look at the
steps. We start with y
0
= y(t
0
) which is exact, then we calculate y
1
which approximates y(t
1
), then we
calculate y
2
which approximates y(t
2
), etc. Notice that when we calculate y
2
, we use y
1
, not y(t
1
). The
numerical solution y
1
has some error, this error will inuence y
2
. Therefore, the error e
2
depends on e
1
.
69
Similarly, the error at the third step, i.e., e
3
, depends on the error at step 2, etc. As a result, it is rather
dicult to estimate e
j
.
The numerical methods given in the previous sections can be written in the following general form:
y
j+1
= (t
j
, h, y
j
), (9.21)
where is some function related to the function f which denes the dierential equation. For example,
the Eulers method is
(t
j
, h, y
j
) = y
j
+hf(t
j
, y
j
).
The midpoint method is
(t
j
, h, y
j
) = y
j
+hf
_
t
j
+
h
2
, y
j
+
h
2
f(t
j
, y
j
)
_
.
If we have the exact solution, we can put the exact solution y(t) into (9.21). That is, we replace y
j
and
y
j+1
by y(t
j
) and y(t
j+1
) in (9.21). When this is done, the two sides of (9.21) will not equal, so we
should consider
T
j+1
= y(t
j+1
) (t
j
, h, y(t
j
)). (9.22)
The above T
j+1
is the so-called local truncation error. If we know the exact solution y(t), then we can
calculate T
j
. In reality, we do not know the exact solution, but we can understand how T
j+1
depends on
step size h by studying the Taylor series of T
j+1
. We are interested in the local truncation error because
it can be estimated and it gives information on the true error. Therefore, we will try to do a Taylor
series for T
j+1
at t
j
, assuming h is small. In fact, we only need to calculate the rst non-zero term of
the Taylor series:
T
j+1
= Ch
p+1
+...
where the integer p is the order of the method, C is a coecient that depends on t
j
, y(t
j
), y
(t
j
),
f(t
j
, y(t
j
)), etc. But C does not depend on the step size h. The above formula for T
j+1
gives us
information on how T
j+1
varies with the step size h. Because h is supposed to be small, we notice that
a larger p implies that |T
j+1
| will be smaller. Therefore, the method will be more accurate if p is larger.
We notice that |T
1
| = e
1
, because y
0
= y(t
0
), thus y
1
= (t
0
, h, y
0
) = (t
0
, h, y(t
0
)). However, it is
clear that |T
j
| = e
j
for j > 1.
When we try to work out the rst non-zero term of the Taylor series of T
j+1
, we work on the general
equation (9.1). This is for the local truncation error at t
j+1
. But the general case at t
j+1
has no real
dierence with the special case at t
1
. If we work out the Taylor series for T
1
, we automatically know the
result at T
j+1
. The integer p (that is the order of the method) should be the same. In the coecient C,
we just need to replace t
0
, y(t
0
), f(t
0
, y(t
0
)), ... by t
j
, y(t
j
), f(t
j
, y(t
j
)), ...
Now, let us work out the local truncation error for Eulers method. The method is y
j+1
= y
j
+
hf(t
j
, y
j
) = (t
j
, h, y
j
). Thus,
T
1
= y(t
1
) (t
0
, h, y(t
0
)) = y(t
1
) y
1
.
We have a Taylor expansion for y(t
1
) at t
0
:
y(t
1
) = y(t
0
) +hy
(t
0
) +
h
2
2
y
(t
0
) +...
Notice that y
(t
0
) = f(t
0
, y(t
0
)). Therefore,
T
1
=
h
2
2
y
(t
0
) +...
70
The power of h is p + 1 for p = 1. Therefore, the Eulers method is a rst order method.
We can show that the local truncation error of the general 2nd order Runge-Kutta methods is
T
1
=
h
3
4
__
2
3

_
y
+y
f
y
_
t=t0
+...
As an example, we prove the result for the midpoint method ( = 1/2). The local truncation error is
T
1
= h
3
_
1
24
y
+
1
8
y
f
y
_
t=t0
+O(h
4
)
Proof: First, since the dierential equation is y
= f(t, y). We use the chain rule and obtain:

y
= f
t
+f
y
y
= f
t
+ff
y
y
= f
tt
+f
ty
y
+ [f]
f
y
+f[f
y
]
= f
tt
+ff
ty
+ [f
t
+f
y
y
]f
y
+f[f
ty
+f
yy
y
]
= f
tt
+ 2ff
ty
+f
2
f
yy
+ [f
t
+ff
y
]f
y
= f
tt
+ 2ff
ty
+f
2
f
yy
+y
f
y
Now for y
1
using the midpoint method, we have
k
1
= f(t
0
, y
0
) = y
(t
0
)
k
2
= f(t
0
+
h
2
, y
0
+
h
2
k
1
) = f(t
0
+
h
2
, y
0
+
h
2
y
(t
0
)).
Now, we need Taylor expansion for functions of two variables. In general, we have
f(t +, y + ) = f(t, y) +f
t
(t, y) + f
y
(t, y)
+
2
2
f
tt
(t, y) +f
ty
(t, y) +

2
2
f
yy
(t, y) +...
Now, for k
2
, apply the above Taylor formula and use f to denote f(t
0
, y
0
) = y
(t
0
), we have
k
2
= f +
h
2
f
t
+
h
2
y
f
y
+
h
2
8
f
tt
+
h
2
y
4
f
ty
+
h
2
(y
)
2
8
f
yy
+O(h
3
)
= y
+
h
2
y
+
h
2
8
[y
f
y
] +O(h
3
).
Here y, f and their derivatives are all evaluated at t
0
. Notice that y(t
0
) = y
0
. Therefore,
y
1
= y +hk
2
= y +hy
+
h
2
2
y
+
h
3
8
[y
f
y
] +O(h
4
)
Use the Taylor expansion
y(t
1
) = y(t
0
+h) = y +hy
+
h
2
2
y
+
h
3
6
y
+O(h
4
)
and the denition for T
1
, we have
T
1
=
h
3
6
y
h
3
8
[y
f
y
] +O(h
4
) = h
3
_
1
24
y
+
1
8
y
f
y
_
+O(h
4
).
9.4 An example
We nish this chapter by an example. Let us consider the following system of ODEs:
da
dt
= bc a (9.23)
db
dt
= 1 ac b (9.24)
dc
dt
=

1
da
dt
+(a c), (9.25)
where ,
1
and are constants. In the following, we consider = 4,
1
= 10 and = 1/
100000. We
use the classical fourth order Runge-Kutta method to solve this system. The main program is
71
% main program for a system of ODEs
%
global sigma sigma1 epsilon
sigma = 4;
sigma1 = 10;
epsilon = 1/sqrt(1.0E5)
%
y = [0.251473082592200 -4.845221747238916 0.068458182058373]
t0 = 0;
tfinal = 400;
steps =20000;
h = (tfinal-t0)/steps;
t = t0;
for j= 1 :steps
y = rk4step(h,t,y);
t = t0 + j*h;
tt(j) = t;
a(j) = y(1);
b(j) = y(2);
c(j) = y(3);
end
subplot(2,1,1), plot(tt,a), xlabel(t), ylabel(a)
subplot(2,1,2), plot(c,b), xlabel(c), ylabel(b)
The 4th order Runge-Kutta method is in rk4step.m:
function y = rk4step(h, t, y)
%
% One step of the 4th order classical Runge-Kutta method
%
k1 = f(t, y);
k2 = f(t+0.5*h, y+0.5*h*k1);
k3 = f(t+0.5*h, y+0.5*h*k2);
k4 = f(t+h, y + h*k3);
y = y + (h/6)*(k1 + 2*k2 + 2*k3 + k4);
The ODE system is given in f.m:
function k=f(t,y)
%
% Define the ODE system. Three global paramaters:
%
global sigma sigma1 epsilon
%
k(1) = y(2)*y(3) - epsilon*y(1);
k(2) = 1 - y(1)*y(3) - epsilon*y(2);
k(3) = (sigma/sigma1)*k(1)+epsilon*sigma*(y(1)-y(3));
The program produces the following two plots.
72
0 100 200 300 400
10
5
0
5
10
t
a
3 2 1 0 1 2 3
6
4
2
0
2
4
6
c
b
73

3513

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

3513

Încărcat de

Drepturi de autor:

Formate disponibile

MA3513 Elementary Numerical Methods

such that f(x

is the exact solution such that f(x

) = 0 and f has a second

(x) = 2x, we have

such that f(x

) = 0, we obtain the sequence (of approximate solutions)

, the following relationship can be obtained:

5)/2 1.618 and is some constant related to f

). This implies that

. For the M ullers method, we obtain

are required to be contin-

, we will have some idea about

) which has a length of (ba)/2

) by elementary Simpsons rule:

) to denote the elementary Simpsons rule on the interval (a

), we can also evaluate the integrals by elementary

). Assuming the 4th

) is a small interval, then

f(x) sin(jx)dx, j > 0.

= f(t, y). We use the chain rule and obtain:

S-ar putea să vă placă și