Sunteți pe pagina 1din 25

Appendix D

Matrix calculus
From too much study, and from extreme passion, cometh madnesse.
Isaac Newton [168, 5]

D.1
D.1.1

Directional derivative, Taylor series


Gradients

Gradient of a differentiable real function f (x) : RK R with respect to its vector


argument is defined uniquely in terms of partial derivatives
f (x)
x1

f (x) ,

f (x)
x2

..
.

f (x)
xK

RK

(1860)

while the second-order gradient of the twice differentiable real function with respect to its
vector argument is traditionally called the Hessian ;
2f (x)

2f (x)
2f (x)
x
x1 x2
x21
1 xK

2f (x)

2f (x)
2f (x)

2
K
2

x
x
x
x
x
2
1
2
K S
f (x) ,
(1861)

..
.. 2
..
..

.
.
.
.
2

2f (x)
2f (x)
f (x)

2
xK x1
xK x2
x
K

The gradient of vector-valued function v(x) : R RN on real domain is a row-vector


i
h
v2 (x)
vN (x)
1 (x)
RN
(1862)
v(x) , vx

x
x
Dattorro, Convex Optimization Euclidean Distance Geometry 2, Moo, v2015.07.21.

577

578

APPENDIX D. MATRIX CALCULUS

while the second-order gradient is


h 2
v1 (x)
2 v(x) , x
2

2 v2 (x)
x2

2 vN (x)
x2

RN

(1863)

Gradient of vector-valued function h(x) : RK RN on vector domain is

h(x) ,

h1 (x)
x1

h2 (x)
x1

hN (x)
x1

h1 (x)
x2

h2 (x)
x2

hN (x)
x2

h1 (x)
xK

h2 (x)
xK

hN (x)
xK

..
.

..
.

..
.

(1864)

= [ h1 (x) h2 (x) hN (x) ] RKN


while the second-order gradient has a three-dimensional written representation dubbed
cubix ;D.1
h1 (x)
hN (x)
2 (x)
x1
hx
x
1
1

h1 (x) h2 (x) hN (x)

2
x
x
x
2
2
2
h(x) ,

..
..
..

(1865)

.
.
.
hN (x)
h2 (x)
h1 (x)
xK
xK
xK
=

2 h1 (x) 2 h2 (x) 2 hN (x) RKN K

where the gradient of each real entry is with respect to vector x as in (1860).
The gradient of real function g(X) : RKL R on matrix domain is
g(X) g(X)

g(X)
X11
X12
X1L

g(X) g(X)
g(X)

X21
X22
X2L
g(X) ,
RKL
..
..
..

.
.
.

g(X)
XK1

g(X)
XK2

g(X)
XKL

(1866)

X(:,1) g(X)
X(:,2) g(X)
..
.

RK1L
X(:,L) g(X)

where gradient X(:, i) is with respect to the i th column of X . The strange appearance of
(1866) in RK1L is meant to suggest a third dimension perpendicular to the page (not
D.1 The word matrix comes from the Latin for womb ; related to the prex matri- derived from mater
meaning mother.

D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES

579

a diagonal matrix). The second-order gradient has representation

g(X)
X11

g(X)
X21
g(X) ,
..

g(X)
XK1
2

g(X)
X12

g(X)
X1L

g(X)
X22
..
.

g(X)
X2L
..
.

g(X)
XK2

g(X)
XKL

RKLKL

(1867)

X(:,1) g(X)
X(:,2) g(X)
..
.

RK1LKL

X(:,L) g(X)

where the gradient is with respect to matrix X .


Gradient of vector-valued function g(X) : RKL RN on matrix domain is a cubix

g(X) ,

X(:,1) g1 (X) X(:,1) g2 (X) X(:,1) gN (X)


X(:,2) g1 (X) X(:,2) g2 (X) X(:,2) gN (X)
..
..
..
.
.
.
X(:,L) g1 (X) X(:,L) g2 (X) X(:,L) gN (X)

(1868)

= [ g1 (X) g2 (X) gN (X) ] RKN L


while the second-order gradient has a five-dimensional representation;

2 g(X) ,

X(:,1) g1 (X) X(:,1) g2 (X) X(:,1) gN (X)


X(:,2) g1 (X) X(:,2) g2 (X) X(:,2) gN (X)
..
..
..
.
.
.

X(:,L) g1 (X) X(:,L) g2 (X) X(:,L) gN (X)

(1869)

= 2 g1 (X) 2 g2 (X) 2 gN (X) RKN LKL

The gradient of matrix-valued function g(X) : RKL RM N on matrix domain has


a four-dimensional representation called quartix (fourth-order tensor )
g11 (X)

g12 (X)

g1N (X)

g21 (X)
g(X) ,
..

.
gM 1 (X)

g22 (X)
..
.

gM 2 (X)

g2N (X)
RM N KL
..

gMN (X)

(1870)

580

APPENDIX D. MATRIX CALCULUS

while the second-order gradient has a six-dimensional representation

2 g11 (X) 2 g12 (X) 2 g1N (X)

2 g21 (X) 2 g22 (X) 2 g2N (X)


2 g(X) ,
RM N KLKL (1871)
..
..
..

.
.
.
2
2
2
gM 1 (X) gM 2 (X) gMN (X)
and so on.

D.1.2

Product rules for matrix-functions

Given dimensionally compatible matrix-valued functions of matrix variable f (X) and


g(X)

X f (X)T g(X) = X (f ) g + X (g) f


(1872)
while [54, 8.3] [333]

X tr f (X)T g(X) = X tr f (X)T g(Z ) + tr g(X) f (Z )T

ZX

(1873)

These expressions implicitly apply as well to scalar-, vector-, or matrix-valued functions


of scalar, vector, or matrix arguments.
D.1.2.0.1 Example. Cubix.
Suppose f (X) : R22 R2 = X Ta and g(X) : R22 R2 = Xb . We wish to find

X f (X)T g(X) = X aTX 2 b


(1874)
using the product rule. Formula (1872) calls for

X aTX 2 b = X (X Ta) Xb + X (Xb) X Ta

(1875)

Consider the first of the two terms:


X (f ) g = X (X Ta) Xb

= (X Ta)1 (X Ta)2 Xb

(1876)

The gradient of X Ta forms a cubix in R222 ; a.k.a, third-order tensor.

X (X Ta) Xb =

(X Ta)1
X11

JJ
JJ
JJ

(X Ta)2
X11

(X Ta)1
X12

(X Ta)1
X21

JJ
JJ
JJ

JJ
JJ
JJ

(X Ta)2
X12

(X Ta)2
X21
(X Ta)1
X22

JJ
JJ
JJ

(X Ta)2
X22

(1877)

(Xb)1

R212

(Xb)

D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES

581

Because gradient of the product (1874) requires total change with respect to change in
each entry of matrix X , the Xb vector must make an inner product with each vector in
that second dimension of the cubix indicated by dotted line segments;

a1

0
0

X (X Ta) Xb =
a
2
=

a1

0
a2

b1 X11 + b2 X12
b1 X21 + b2 X22

R212
(1878)

a1 (b1 X11 + b2 X12 ) a1 (b1 X21 + b2 X22 )


a2 (b1 X11 + b2 X12 ) a2 (b1 X21 + b2 X22 )

= abTX T

R22

where the cubix appears as a complete 2 2 2 matrix. In like manner for the second
term X (g) f

b1

X (Xb) X Ta =
0

0
b2
b1
0

= X TabT R22

b2

X11 a1 + X21 a2
X12 a1 + X22 a2

R212

(1879)

The solution
X aTX 2 b = abTX T + X TabT

(1880)

can be found from Table D.2.1 or verified using (1873).


D.1.2.1

Kronecker product

A partial remedy for venturing into hyperdimensional matrix representations, such as the
cubix or quartix, is to first vectorize matrices as in (37). This device gives rise to the
Kronecker product of matrices ; a.k.a, tensor product. Although it sees reversal in
the literature, [344, 2.1] we adopt the definition: for A Rmn and B Rpq

B11 A B12 A B1q A


B21 A B22 A B2q A

(1881)
B A ,
Rpmq n
..
..
..

.
.
.
Bp1 A Bp2 A

Bpq A

for which A 1 = 1 A = A (real unity acts like Identity).


One advantage to vectorization is existence of the traditional two-dimensional matrix
representation (second-order tensor ) for the second-order gradient of a real function with
respect to a vectorized matrix. For example, from A.1.1 no.33 (D.2.1) for square A , B
Rnn [182, 5.2] [13, 3]
2

2
2
n
T
T
T
T
T
vec
X tr(AXBX ) = vec X vec(X) (B A) vec X = B A + B A R

n2

(1882)

582

APPENDIX D. MATRIX CALCULUS

To disadvantage is a large new but known set of algebraic rules (A.1.1) and the fact
that its mere use does not generally guarantee two-dimensional matrix representation of
gradients.
Another application of the Kronecker product is to reverse order of appearance in
a matrix product: Suppose we wish to weight the columns of a matrix S RM N , for
example, by respective entries wi from the main diagonal in

w1
0
..
SN
W ,
(1883)
.
T
0
wN

A conventional means for accomplishing column weighting is to multiply S by diagonal


matrix W on the right-hand side:

w1
0

..
= S(: , 1)w1 S(: , N )wN RM N
S W = S
(1884)
.
0T
wN

To reverse product order such that diagonal matrix W instead appears to the left of S :
for I SM (Law)

S(: , 1)
0
0
.

0
S(: , 2) . .

T
(1885)
S W = ((W ) I )
RM N
.
.
..
..

0
0
0 S(: , N )
To instead weight the rows of S via

S(1 , :)
0

0
S(2
, :)

WS =
.
.

.
0

diagonal matrix W SM , for I SN

0
..

((W ) I ) RM N
..

.
0
0 S(M , :)

For any matrices of like size, S , Y RM N

S(: , 1)
0

0
S(: , 2)

S Y = (Y (: , 1)) (Y (: , N ))
..

.
0

..
..

0
.

.
0

0
S(: , N )

(1886)

RM N (1887)

which converts a Hadamard product into a standard matrix product. In the special case
that S = s and Y = y are vectors in RM
s y = (s)y

(1888)

sT y = ysT
s y T = sy T

(1889)

D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES

D.1.3

583

Chain rules for composite matrix-functions

Given dimensionally compatible matrix-valued functions of matrix variable f (X) and


g(X) [235, 15.7]

X g f (X)T = X f T f g
(1890)

X2 g f (X)T = X X f T f g = X2 f f g + X f T f2 g X f
(1891)
D.1.3.1

D.1.3.1.1

Two arguments

X g f (X)T , h(X)T = X f T f g + X hT h g

Example.

Chain rule for two arguments.

T
g f (x)T , h(x)T = (f (x) + h(x)) A(f (x) + h(x))

x1
x1
f (x) =
,
h(x) =
x2
x2

x g f (x)T , h(x)T =

1
0

x g f (x)T , h(x)T =

T
(A + A )(f + h) +
0

1+
0

0
1+

0
1

(A + AT )(f + h)

x1
x1
T
(A + A )
+
x2
x2

lim x g f (x)T , h(x)T = (A + AT )x

from Table D.2.1.

(1892)
[43, 1.1]
(1893)
(1894)

(1895)

(1896)

(1897)
2

These foregoing formulae remain correct when gradient produces hyperdimensional


representation:

D.1.4

First directional derivative

Assume that a differentiable function g(X) : RKL RM N has continuous first- and
second-order gradients g and 2 g over dom g which is an open set. We seek
simple expressions for the first and second directional derivatives in direction Y RKL :
Y

respectively, dg RM N and dg 2 RM N .

Assuming that the limit exists, we may state the partial derivative of the mn th entry
of g with respect to the kl th entry of X ;
gmn (X + t ek eT
gmn (X)
l ) gmn (X)
= lim
R
t0
Xkl
t

(1898)

584

APPENDIX D. MATRIX CALCULUS

where ek is the k th standard basis vector in RK while el is the l th standard basis vector in
RL . The total number of partial derivatives equals KLM N while the gradient is defined
in their terms; the mn th entry of the gradient is

gmn (X)
gmn (X)
gmn (X)

X12
X1L
X11

gmn (X) gmn (X)

g
mn (X)

X
X
X

RKL
21
22
2L
gmn (X) =
(1899)

..
..
..

.
.
.

gmn (X)
gmn (X)
mn (X)
gX
XK1
XK2
KL
while the gradient is a quartix

g11 (X)

g12 (X)

g (X) g (X)
21
22

g(X) =
..
..

.
.
gM 1 (X) gM 2 (X)

g1N (X)

g2N (X)
..
.

gMN (X)

RM N KL

(1900)

By simply rotating our perspective of a four-dimensional representation of gradient matrix,


we find one of three useful transpositions of this quartix (connoted T1 ):

g(X) g(X)
g(X)
X11
X12
X1L

g(X) g(X)
g(X)

X22
X2L
(1901)
g(X)T1 = X. 21
RKLM N
.
..

..
..
.

g(X)
g(X)
g(X)
XK1
XK2
XKL

When the limit for t R exists, it is easy to show by substitution of variables in


(1898)
gmn (X + t Ykl ek eT
gmn (X)
l ) gmn (X)
Ykl = lim
R
(1902)
t0
Xkl
t
which may be interpreted as the change in gmn at X when the change in Xkl is equal to
Ykl , the kl th entry of any Y RKL . Because the total change in gmn (X) due to Y is
the sum of change with respect to each and every Xkl , the mn th entry of the directional
derivative is the corresponding total differential [235, 15.8]
dgmn (X)|dXY =

X gmn (X)
Xkl

k,l

X
k,l

Ykl = tr gmn (X)T Y

gmn (X + t Ykl ek eT
l ) gmn (X)
t0
t
lim

gmn (X + t Y ) gmn (X)


lim
t

d
=
gmn (X + t Y )
dt t=0
=

t0

(1903)
(1904)
(1905)
(1906)

D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES

585

where t R . Assuming finite Y , equation (1905) is called the G


ateaux differential
[42, App.A.5] [215, D.2.1] [377, 5.28] whose existence is implied by existence of the
Frechet differential (the sum in (1903)). [266, 7.2] Each may be understood as the change
in gmn at X when the change in X is equal in magnitude and direction to Y .D.2 Hence
the directional derivative,

dg11 (X) dg12 (X) dg1N (X)

Y
dg21 (X) dg22 (X) dg2N (X)
dg (X) ,
RM N

..
..
..

.
.
.

dg (X) dg (X) dg (X)


M1

M2

MN

tr g11 (X)T Y

tr g21 (X)T Y
..
.

tr gM 1 (X)T Y

P g11 (X)
Xkl

Ykl

k,l
P g (X)
21

Ykl

= k,l X. kl

..

P gM 1 (X)
Xkl Ykl
k,l

from which it follows

tr g12 (X)T Y

tr g22 (X)T Y
..
.

tr gM 2 (X)T Y

P g12 (X)
k,l

Xkl

P g22 (X)
k,l

Xkl

Ykl
Ykl

..
.

P gM 2 (X)
k,l

Xkl

dg (X) =

Ykl

dXY

P g1N (X)
k,l

Xkl

Xkl

P g2N (X)
k,l

Xkl

Ykl
Ykl

..
.

P gMN (X)
k,l

X g(X)
k,l

tr g1N (X)T Y

tr g2N (X)T Y
..
.

tr gMN (X)T Y

Xkl

Ykl

(1907)

Ykl

(1908)

Yet for all X dom g , any Y RKL , and some open interval of t R
Y

g(X + t Y ) = g(X) + t dg (X) + o(t2 )

(1909)

which is the first-order Taylor series expansion about X . [235, 18.4] [166, 2.3.4]
Differentiation with respect to t and subsequent t-zeroing isolates the second term of
expansion. Thus differentiating and zeroing g(X + t Y ) in t is an operation equivalent
to individually differentiating and zeroing every entry gmn (X + t Y ) as in (1906). So the
directional derivative of g(X) : RKL RM N in any direction Y RKL evaluated at
X dom g becomes

Y
d
g(X + t Y ) RM N
(1910)
dg (X) =
dt t=0
[294, 2.1, 5.4.5] [35, 6.3.1] which is simplest. In case of a real function g(X) : RKL R
Y

dg (X) = tr g(X)T Y
(1932)
D.2 Although

Y is a matrix, we may regard it as a vector in RKL .

586

APPENDIX D. MATRIX CALCULUS


T

( , f ())

f ( + t y)

x f ()
x f ()
1
2 df()

f (x)

Figure 174: Strictly convex quadratic bowl in R2 R ; f (x) = xTx : R2 R versus x


on some open disc in R2 . Plane slice H is perpendicular to function domain. Slice
intersection with domain connotes bidirectional vector y . Slope of tangent line T at
point ( , f ()) is value of directional derivative x f ()Ty (1935) at in slice direction y .
Negative gradient x f (x) R2 is direction of steepest descent. [414] [235, 15.6] [166]
Whenvector R3 entry 3 is half directional derivative in gradient direction at and
1
when
= x f () , then points directly toward bowl bottom.
2
In case g(X) : RK R

dg (X) = g(X)T Y

(1935)

Unlike gradient, directional derivative does not expand dimension; directional


derivative (1910) retains the dimensions of g . The derivative with respect to t makes
the directional derivative resemble ordinary calculus (D.2); e.g, when g(X) is linear,
Y

dg (X) = g(Y ). [266, 7.2]

D.1.4.1

Interpretation of directional derivative

In the case of any differentiable real function g(X) : RKL R , the directional derivative
of g(X) at X in any direction Y yields the slope of g along the line {X + t Y | t R}
through its domain evaluated at t = 0. For higher-dimensional functions, by (1907), this
slope interpretation can be applied to each entry of the directional derivative.
Figure 174, for example, shows a plane slice of a real convex bowl-shaped function
f (x) along a line { + t y | t R} through its domain. The slice reveals a one-dimensional
real function of t ; f ( + t y). The directional derivative at x = in direction y is
the slope of f ( + t y) with respect to t at t = 0. In the case of a real function having
vector argument h(X) : RK R , its directional derivative in the normalized direction

D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES

587

of its gradient is the gradient magnitude. (1935) For a real function of real variable, the
directional derivative evaluated at any point in the function domain is just the slope of
that function there scaled by the real direction. (confer 3.6)
Directional derivative generalizes our one-dimensional notion of derivative to a
multidimensional domain. When direction Y coincides with a member of the standard
Cartesian basis ek eT
l (60), then a single partial derivative g(X)/Xkl is obtained from
directional derivative (1908); such is each entry of gradient g(X) in equalities (1932)
and (1935), for example.
D.1.4.1.1 Theorem. Directional derivative optimality condition.
[266, 7.4]
Suppose f (X) : RKL R is minimized on convex set C RKL by X , and the
directional derivative of f exists there. Then for all X C
XX

df (X) 0

(1911)

D.1.4.1.2 Example. Simple bowl.


Bowl function (Figure 174)
f (x) : RK R , (x a)T (x a) b
has function offset
(1861) everywhere
quadratic f (x) has
anywhere in dom f

(1912)

b R , axis of revolution at x = a , and positive definite Hessian


in its domain (an open hyperdisc in RK ); id est, strictly convex
unique global minimum equal to b at x = a . A vector based
R pointing toward the unique bowl-bottom is specified:

xa
f (x) + b

RK R

(1913)

Such a vector is

since the gradient is

x f (x)
x f (x)
1
2 df(x)

x f (x) = 2(x a)

(1914)

(1915)

and the directional derivative in direction of the gradient is (1935)


x f (x)

df(x) = x f (x)T x f (x) = 4(x a)T (x a) = 4(f (x) + b)

(1916)
2

588

APPENDIX D. MATRIX CALCULUS

D.1.5

Second directional derivative

By similar argument, it so happens: the second directional derivative is equally simple.


Given g(X) : RKL RM N on open domain,

gmn (X)
gmn (X)
=
=

Xkl
Xkl

2gmn (X)
Xkl X11

2gmn (X)
Xkl X12

2gmn (X)
Xkl X21

2gmn (X)
Xkl X22

2gmn (X)
Xkl XK1

2gmn (X)
Xkl XK2

..
.

mn (X)
gX
11

gmn (X)

X21
2 gmn (X) =

..

mn (X)
gX
K1

..
.

2gmn (X)
Xkl XKL

mn (X)
gX
1L

mn (X)
gX
22

mn (X)
gX
2L

mn (X)
gX
K2

mn (X)
gX
KL

..
.

gmn (X)
X11

gmn (X)
X12

gmn (X)
X21

gmn (X)
X22

gmn (X)
XK1

gmn (X)
XK2

..
.

2gmn (X)
Xkl X2L

..
.

mn (X)
gX
12

2gmn (X)
Xkl X1L

..
.

..
.

gmn (X)
X1L

gmn (X)
X2L

gmn (X)
XKL

..
.

RKL

(1917)

RKLKL

(1918)

Rotating our perspective, we get several views of the second-order gradient:

2 g11 (X)

2 g (X)
21

2 g(X) =
..

2 gM 1 (X)

T1

g(X)

T2

g(X)

2 g12 (X)

2 g22 (X)
..
.

2 gM 2 (X)
g(X)
X11

g(X)
X21
=
..

g(X)
XK1

g(X)
X12
g(X)
X22
..
.

2 g1N (X)

2 g2N (X)
..
.

2 gMN (X)

g(X)
XK2

g(X)
X11

g(X)
X12

g(X)
X21

g(X)
X22

g(X)
XK1

g(X)
XK2

..
.

..
.

g(X)
X1L

RM N KLKL

g(X)
X2L
RKLM N KL
..

g(X)
XKL
g(X)
X1L

g(X)
X2L

g(X)
XKL

..
.

RKLKLM N

(1919)

(1920)

(1921)

D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES

589

Assuming the limits exist, we may state the partial derivative of the mn th entry of g with
respect to the kl th and ij th entries of X ;
2gmn (X)
Xkl Xij

T
T
T
gmn (X+t ek eT
l + ei ej )gmn (X+t ek el )(gmn (X+ ei ej )gmn (X))

t
,t0

lim

(1922)

Differentiating (1902) and then scaling by Yij


2gmn (X)
Xkl Xij Ykl Yij

gmn (X+t Ykl ek eT


l )gmn (X)
Yij
X
t
ij
t0

= lim

(1923)

T
T
T
gmn (X+t Ykl ek eT
l + Yij ei ej )gmn (X+t Ykl ek el )(gmn (X+ Yij ei ej )gmn (X))
t
,t0

lim

which can be proved by substitution of variables in (1922). The mn th second-order total


differential due to any Y RKL is
d 2gmn (X)|dXY =

X X 2gmn (X)

T
Ykl Yij = tr X tr gmn (X)T Y Y
Xkl Xij
i,j

(1924)

k,l

X
i,j

=
=

lim

t0

lim

gmn (X + t Y ) gmn (X)


Yij
Xij t

(1925)

gmn (X + 2t Y ) 2gmn (X + t Y ) + gmn (X)


t2

(1926)

t0

d
gmn (X + t Y )
dt2 t=0

Hence the second directional derivative,


2
d g11 (X) d 2g12 (X)
d 2g21 (X) d 2g22 (X)
Y

dg 2(X) ,
..
..

.
.
d 2gM 1 (X)

d 2gM 2 (X)

(1927)

d 2g1N (X)
d 2g2N (X)
..
.

RM N

2
d gMN (X) dXY

T
tr tr g12 (X)T Y Y tr tr g1N (X)T Y Y
tr tr g11 (X)T Y Y

tr tr g21 (X)T Y Y
tr tr g22 (X)T Y Y tr tr g2N (X)T Y Y

..
..
..

.
.
.

T
T
T
T
tr tr gM 1 (X) Y Y
tr tr gM 2 (X) Y Y tr tr gMN (X) Y Y

PP 2
g11 (X)
Ykl Yij
i,j k,l Xkl Xij

P P 2g21 (X)

Xkl Xij Ykl Yij


=
i,j k,l
..

P P 2gM 1 (X)
Xkl Xij Ykl Yij
i,j k,l

PP
i,j k,l

PP
i,j k,l

PP
i,j k,l

2g12 (X)
Xkl Xij Ykl Yij
2

g22 (X)
Xkl Xij Ykl Yij

..
.

2gM 2 (X)
Xkl Xij Ykl Yij

PP
i,j k,l

PP
i,j k,l

2g1N (X)
Xkl Xij Ykl Yij
2

g2N (X)
Xkl Xij Ykl Yij

..
.

P P 2gMN (X)
i,j k,l

Xkl Xij

Ykl Yij

(1928)

590

APPENDIX D. MATRIX CALCULUS

from which it follows


Y
2

dg (X) =

X X 2g(X)
X Y
Ykl Yij =
dg (X) Yij
Xkl Xij
Xij
i,j
i,j

(1929)

k,l

Yet for all X dom g , any Y RKL , and some open interval of t R
Y

g(X + t Y ) = g(X) + t dg (X) +

1 2 Y2
t dg (X) + o(t3 )
2!

(1930)

which is the second-order Taylor series expansion about X . [235, 18.4] [166, 2.3.4]
Differentiating twice with respect to t and subsequent t-zeroing isolates the third term of
the expansion. Thus differentiating and zeroing g(X + t Y ) in t is an operation equivalent
to individually differentiating and zeroing every entry gmn (X + t Y ) as in (1927). So
the second directional derivative of g(X) : RKL RM N becomes [294, 2.1, 5.4.5]
[35, 6.3.1]

Y
d 2
2
(1931)
dg (X) = 2 g(X + t Y ) RM N
dt
t=0

which is again simplest. (confer (1910)) Directional derivative retains the dimensions of g .

D.1.6

directional derivative expressions

In the case of a real function g(X) : RKL R , all its directional derivatives are in R :
Y

dg (X) = tr g(X)T Y

Y
2

T
dg (X) = tr X tr g(X)T Y Y = tr X dg (X)T Y

T T
2
T
T
dg (X) = tr X tr X tr g(X) Y Y Y = tr X dg (X) Y

Y
3

(1932)

(1933)

(1934)

In the case g(X) : RK R has vector argument, they further simplify:


Y

dg (X) = g(X)T Y

Y
2

dg (X) = Y T 2 g(X)Y

(1936)

T
dg (X) = X Y T 2 g(X)Y Y

(1937)

Y
3

and so on.

(1935)

D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES

D.1.7

591

Taylor series

Series expansions of the differentiable matrix-valued function g(X) , of matrix argument,


were given earlier in (1909) and (1930). Assuming g(X) has continuous first-, second-, and
third-order gradients over the open set dom g , then for X dom g and any Y RKL
the complete Taylor series is expressed on some open interval of R
Y

g(X + Y ) = g(X) + dg (X) +

1 3 Y3
1 2 Y2
dg (X) +
dg (X) + o(4 )
2!
3!

(1938)

or on some open interval of kY k2


Y X

g(Y ) = g(X) + dg(X) +

1 Y2 X
1 Y3 X
dg (X) +
dg (X) + o(kY k4 )
2!
3!

(1939)

which are third-order expansions about X . The mean value theorem from calculus is what
insures finite order of the series. [235] [43, 1.1] [42, App.A.5] [215, 0.4] These somewhat
unbelievable formulae imply that a function can be determined over the whole of its domain
by knowing its value and all its directional derivatives at a single point X .
D.1.7.0.1 Example. Inverse-matrix function.
Say g(Y ) = Y 1 . From the table on page 596,

Y
d
g(X + t Y ) = X 1 Y X 1
dg (X) =
dt t=0
Y
2

d 2
dg (X) = 2 g(X + t Y ) = 2X 1 Y X 1 Y X 1
dt t=0

Y
d3
dg 3(X) = 3 g(X + t Y ) = 6X 1 Y X 1 Y X 1 Y X 1
dt t=0

(1940)

(1941)
(1942)

Lets find the Taylor series expansion of g about X = I : Since g(I ) = I , for kY k2 < 1
( = 1 in (1938))
g(I + Y ) = (I + Y )1 = I Y + Y 2 Y 3 + . . .

(1943)

If Y is small, (I + Y )1 I Y .D.3 Now we find Taylor series expansion about X :


g(X + Y ) = (X + Y )1 = X 1 X 1 Y X 1 + 2X 1 Y X 1 Y X 1 . . .
If Y is small, (X + Y )1 X 1 X 1 Y X 1 .

(1944)
2

D.1.7.0.2 Exercise. log det .


(confer [63, p.644])
Find the first three terms of a Taylor series expansion for log det Y . Specify an open
interval over which the expansion holds in vicinity of X .
H
D.3 Had

we instead set g(Y ) = (I + Y )1 , then the equivalent expansion would have been about X = 0.

592

D.1.8

APPENDIX D. MATRIX CALCULUS

Correspondence of gradient to derivative

From the foregoing expressions for directional derivative, we derive a relationship between
gradient with respect to matrix X and derivative with respect to real variable t :
D.1.8.1

first-order

Removing evaluation at t = 0 from (1910),D.4 we find an expression for the directional


derivative of g(X) in direction Y evaluated anywhere along a line {X + t Y | t R}
intersecting dom g
Y
d
dg (X + t Y ) = g(X + t Y )
(1945)
dt
In the general case g(X) : RKL RM N , from (1903) and (1906) we find

d
tr X gmn (X + t Y )T Y = gmn (X + t Y )
dt

(1946)

which is valid at t = 0, of course, when X dom g . In the important case of a real


function g(X) : RKL R , from (1932) we have simply

d
tr X g(X + t Y )T Y = g(X + t Y )
dt

(1947)

When, additionally, g(X) : RK R has vector argument,


X g(X + t Y )T Y =

d
g(X + t Y )
dt

(1948)

D.1.8.1.1 Example. Gradient.


g(X) = wTX TXw , X RKL , w RL . Using the tables in D.2,

tr X g(X + t Y )T Y = tr 2wwT(X T + t Y T )Y
T

= 2w (X Y + t Y Y )w

(1949)
(1950)

Applying equivalence (1947),


d T
d
g(X + t Y ) =
w (X + t Y )T (X + t Y )w
dt
dt

= wT X T Y + Y TX + 2t Y T Y w
T

= 2w (X Y + t Y Y )w

which is the same as (1950). Hence, the equivalence is demonstrated.


D.4 Justied

by replacing X with X + t Y in (1903)-(1905); beginning,


dgmn (X + t Y )|dXY =

X gmn (X + t Y )
Ykl
Xkl
k, l

(1951)
(1952)
(1953)

D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES


It is easy to extract g(X) from (1953) knowing only (1947):

tr X g(X + t Y )T Y
= 2wT(X T Y + t Y T Y )w
= 2 tr wwT(X T + t Y T )Y

tr X g(X)T Y
= 2 tr wwTX T Y

X g(X) = 2XwwT
D.1.8.2

593

(1954)

second-order

Likewise removing the evaluation at t = 0 from (1931),


Y
2

dg (X + t Y ) =

d2
g(X + t Y )
dt2

(1955)

we can find a similar relationship between second-order gradient and second derivative: In
the general case g(X) : RKL RM N from (1924) and (1927),

T
d2
tr X tr X gmn (X + t Y )T Y Y = 2 gmn (X + t Y )
dt

(1956)

In the case of a real function g(X) : RKL R we have, of course,

T
d2
tr X tr X g(X + t Y )T Y Y = 2 g(X + t Y )
dt

(1957)

From (1936), the simpler case, where real function g(X) : RK R has vector argument,
Y T X2 g(X + t Y )Y =

d2
g(X + t Y )
dt2

(1958)

D.1.8.2.1 Example. Second-order gradient.


We want to find 2 g(X) RKKKK given real function g(X) = log det X having
domain int SK
+ . From the tables in D.2,
h(X) , g(X) = X 1 int SK
+
so 2 g(X) = h(X). By (1946) and (1909), for Y SK

d
tr hmn (X)T Y
=
hmn (X + t Y )
dt t=0

d
=
h(X
+
t
Y
)
dt
mn

t=0
d
=
(X + t Y )1
dt t=0
mn

= X 1 Y X 1 mn

(1959)

(1960)
(1961)
(1962)
(1963)

594

APPENDIX D. MATRIX CALCULUS

KK
Setting Y to a member of {ek eT
| k , l = 1 . . . K } , and employing a property (39)
l R
of the trace function we find

1
2 g(X)mnkl = tr hmn (X)T ek eT
= hmn (X)kl = X 1 ek eT
(1964)
l
l X
mn

1
2 g(X)kl = h(X)kl = X 1 ek eT
RKK
l X

(1965)
2

From all these first- and second-order expressions, we may generate new ones
by evaluating both sides at arbitrary t (in some open interval) but only after the
differentiation.

D.2

Tables of gradients and derivatives

Results may be numerically proven by Romberg extrapolation. [115] When proving


results for symmetric matrices algebraically, it is critical to take gradients ignoring
symmetry and to then substitute symmetric entries afterward. [182] [67]
a , b Rn ,
x , y Rk ,
A , B Rmn ,
X , Y RKL ,
i, j , k , , K , L , m , n , M , N are integers, unless otherwise noted.

t , R ,

x means ((x) ) for R ; id est, entrywise vector exponentiation. is the


main-diagonal linear operator (1504). x0 , 1, X 0 , I if square.
d
dx1

d
dx

..
.

d
dxk

dg(x) , dg 2(x) (directional derivatives D.1), log x ,

ex,

|x| ,

sgn x , x/y (Hadamard quotient), x (entrywise square root), etcetera, are maps
f : Rk Rk that maintain dimension; e.g, (A.1.1)
d 1
x
, x 1T (x)1 1
dx

(1966)

For A a scalar or square matrix, we have the Taylor series [80, 3.6]

X
1 k
e ,
A
k!
A

(1967)

k=0

Further, [348, 5.4]


eA 0

A Sm

(1968)

For all square A and integer k


detk A = det Ak

(1969)

D.2. TABLES OF GRADIENTS AND DERIVATIVES

D.2.1

595

algebraic

x x = x xT = I Rkk

X X = X X T , I RKLKL

(Identity)

x (Ax b) = AT

x xTA bT = A

x (Ax b)T(Ax b) = 2AT(Ax b)


x2 (Ax b)T(Ax b) = 2ATA
x kAx bk = AT(Ax b)/kAx bk
x z T |Ax b| = AT (z) sgn(Ax
b) , z
i 6= 0 (Ax b)i 6= 0

(y)
sgn(Ax b)
x 1T f (|Ax b|) = AT dfdy

y=|Axb|

x xTAx + 2xTB y + y TC y = A + AT x + 2B y
T
T
x (x
+ y) A(x + y) = (A +A )(x + y)
x2 xTAx + 2xTB y + y TC y = A + AT

X aTXb = X bTX Ta = abT


X aTX 2 b = X T abT + abT X T
X aTX 1 b = X T abT X T

x aTxTxb = 2xaT b

confer
X 1
1
(1901)
= X 1 ek eT
X
,
l
Xkl
(1965)
X aTX TXb = X(abT + baT )

x aTxxT b = (abT + baT )x

X aTXX T b = (abT + baT )X

x aTxTxa = 2xaTa

X aTX TXa = 2XaaT

x aTxxTa = 2aaTx

X aTXX T a = 2aaT X

x aTyxT b = b aTy

X aT Y X T b = baT Y

x aTy Tx b = y bTa

X aT Y TXb = Y abT

x aTxy T b = a bTy

X aTXY T b = abT Y

x aTxTy b = y aT b

X aTX T Y b = Y baT

X (X 1 )kl =

596

APPENDIX D. MATRIX CALCULUS

algebraic continued

d
dt (X +

tY ) = Y

d
T
dt B (X +
d
T
dt B (X +
d
T
dt B (X +

t Y )1 A = B T (X + t Y )1 Y (X + t Y )1 A
t Y )TA = B T (X + t Y )T Y T (X + t Y )TA
t Y ) A = ... , 1 1, X , Y SM
+

d2
B T (X +
dt2
3
d
B T (X +
dt3

t Y )1 A =

2B T (X + t Y )1 Y (X + t Y )1 Y (X + t Y )1 A

t Y )1 A = 6B T (X + t Y )1 Y (X + t Y )1 Y (X + t Y )1 Y (X + t Y )1 A

d
(X + t Y )TA(X + t Y ) = Y TAX +
dt

2
d
(X + t Y )TA(X + t Y ) = 2 Y TAY
dt2
1
d
T
dt (X+ t Y ) A(X + t Y )
1 T
T

X TAY + 2 t Y TAY

1
= (X + t Y ) A(X + t Y ) (Y AX + X TAY + 2 t Y TAY ) (X + t Y )TA(X + t Y )
d
dt ((X + t Y )A(X + t Y )) = YAX + XAY + 2 t YAY
d2
((X + t Y )A(X + t Y )) = 2 YAY
dt2

D.2.1.0.1 Exercise. Expand these tables.


Provide unfinished table entries indicated by . . . throughout D.2.

D.2.1.0.2 Exercise. log .


(D.1.7)
Find the first four terms of the Taylor series expansion for log x about x = 1. Prove that
log x x 1 ; alternatively,
plot the supporting hyperplane to the hypograph of log x at

x
1
=
.
H
log x
0

D.2.2

trace Kronecker

vec X tr(A XBX T ) = vec X vec(X)T (B T A) vec X = (B AT + B T A) vec X


2
2
T
T
T
T
T
vec
X tr(A XBX ) = vec X vec(X) (B A) vec X = B A + B A

D.2. TABLES OF GRADIENTS AND DERIVATIVES

D.2.3

597

trace

x x = I

X tr X = X tr X = I

d 1
x 1T (x)1 1 = dx
x = x2
T
1
x 1 (x) y = (x)2 y

X tr X 1 = X 2T
X tr(X 1 Y ) = X tr(Y X 1 ) = X T Y TX T

d
dx x

X tr X = X 1 ,

= x 1

X SM

X tr X j = jX (j1)T
x (b aTx)1 = (b aTx)2 a
x (b aTx) = (b aTx)1 a
x xTy = x y Tx = y


T
X tr (B AX)1 = (B AX)2 A
X tr(X T Y ) = X tr(Y X T ) = X tr(Y TX) = X tr(XY T ) = Y
X tr(AXBX T ) = X tr(XBX TA) = ATXB T + AXB
X tr(AXBX) = X tr(XBXA) = ATX TB T + B TX TAT
X tr(AXAXAX) = X tr(XAXAXA) = 3(AXAXA )T
X tr(Y X k ) = X tr(X k Y ) =

k1
P

X i Y X k1i

i=0

X tr(Y TXX T Y ) = X tr(X T Y Y TX) = 2 Y Y TX


X tr(Y TX TXY ) = X tr(XY Y TX T ) = 2XY Y T

X tr (X + Y )T (X + Y ) = 2(X + Y ) = X kX + Y k2F
X tr((X + Y )(X + Y )) = 2(X + Y )T
X tr(ATXB) = X tr(X TAB T ) =
AB T
T 1
T
T
T
X tr(A X B) = X tr(X AB ) = X AB T X T
X aTXb = X tr(baTX) = X tr(XbaT ) = abT
X bTX Ta = X tr(X TabT ) = X tr(abTX T ) = abT
X aTX 1 b = X tr(X T abT ) = X T abT X T

X aTX b = ...

598

APPENDIX D. MATRIX CALCULUS

trace continued
d
dt

d
tr g(X + t Y ) = tr dt
g(X + t Y )

d
dt

tr(X + t Y ) = tr Y

d
dt

tr j(X + t Y ) = j tr j1(X + t Y ) tr Y

d
dt

tr(X + t Y )j = j tr (X + t Y )j1 Y

d
dt

[219, p.491]

( j)

tr((X + t Y )Y ) = tr Y 2

d
dt

tr (X + t Y )k Y =

d
dt

tr(Y (X + t Y )k ) = k tr (X + t Y )k1 Y 2 ,

d
dt

tr (X + t Y )k Y =

d
dt

tr(Y (X + t Y )k ) = tr

d
dt
d
dt
d
dt
d
dt
d
dt

k1
P

k {0, 1, 2}

(X + t Y )i Y (X + t Y )k1i Y

i=0

tr(X + t Y )1 Y = tr (X + t Y )1 Y (X + t Y )1 Y
trB T (X + t Y )1 A = trB T (X + t Y )1 Y (X + t Y )1 A
trB T (X + t Y )TA = tr B T (X + t Y )T Y T (X + t Y )TA
tr B T (X + t Y )k A = ... , k > 0

tr B T (X + t Y ) A = ... , 1 1, X , Y SM
+

d2
dt2

tr B T (X + t Y )1 A = 2 tr B T (X + t Y )1 Y (X + t Y )1 Y (X + t Y )1 A

d
(X + t Y )TA(X + t Y ) = tr Y TAX + X TAY
dt tr

d2
tr (X + t Y )TA(X + t Y ) = 2 tr Y TAY
dt2

1
d
+ t Y )TA(X + t Y )
dt tr (X

1 T
T
T

= tr (X + t Y ) A(X + t Y )

+ 2 t Y TAY

1
(Y AX + X AY + 2 t Y TAY ) (X + t Y )TA(X + t Y )

d
dt tr((X + t Y )A(X + t Y )) = tr(YAX + XAY
d2
tr((X + t Y )A(X + t Y )) = 2 tr(YAY )
dt2

+ 2 t YAY )

D.2. TABLES OF GRADIENTS AND DERIVATIVES

D.2.4

599

logarithmic determinant

x 0, det X > 0 on some neighborhood of X , and det(X + t Y ) > 0 on some open


interval of t ; otherwise, log( ) would be discontinuous. [86, p.75]
d
dx

log x = x1

X log det X = X T
X2 log det(X)kl =

X T
1 T
= X 1 ek eT
, confer (1918)(1965)
l X
Xkl

d
dx

log x1 = x1

X log det X 1 = X T

d
dx

log x = x1

X log det X = X T

X log det X = X T
X log det X k = X log detk X = kX T
X log det (X + t Y ) = (X + t Y )T
1
x log(aTx + b) = a aTx+b

X log det(AX + B) = AT(AX + B)T


X log det(I ATXA) = A(I ATXA)TAT
X log det(X + t Y )k = X log detk (X + t Y ) = k(X + t Y )T
d
dt

log det(X + t Y ) = tr ((X + t Y )1 Y )

d2
dt2
d
dt

log det(X + t Y )1 = tr ((X + t Y )1 Y )

d2
dt2
d
dt

log det(X + t Y ) = tr ((X + t Y )1 Y (X + t Y )1 Y )

log det(X + t Y )1 = tr ((X + t Y )1 Y (X + t Y )1 Y )

log det((A(x
+ t y) + a)2 + I)

= tr ((A(x + t y) + a)2 + I)

2(A(x + t y) + a)(Ay)

600

APPENDIX D. MATRIX CALCULUS

D.2.5

determinant

X det X = X det X T = det(X)X T


X det X 1 = det(X 1 )X T = det(X)1 X T
X det X = det (X)X T

X det X = det(X )X T

X det X k = k detk1(X) tr(X)I X T ,

X R22

X det X k = X detk X = k det(X k )X T = k detk (X)X T


X det (X + t Y ) = det (X + t Y )(X + t Y )T
X det(X + t Y )k = X detk (X + t Y ) = k detk (X + t Y )(X + t Y )T
d
dt

det(X + t Y ) = det(X + t Y ) tr((X + t Y )1 Y )

d2
dt2
d
dt

det(X + t Y )1 = det(X + t Y )1 tr((X + t Y )1 Y )

d2
dt2
d
dt

det(X + t Y ) = det(X + t Y )(tr 2 (X + t Y )1 Y tr((X + t Y )1 Y (X + t Y )1 Y ))


det(X + t Y )1 = det(X + t Y )1 (tr 2 ((X + t Y )1 Y ) + tr((X + t Y )1 Y (X + t Y )1 Y ))

det (X + t Y ) = det (X + t Y ) tr((X + t Y )1 Y )

D.2.6

logarithmic

Matrix logarithm.
d
dt log(X +
d
dt log(I

t Y ) = Y (X + t Y )1 = (X + t Y )1 Y ,

XY = YX

t Y ) = Y (I t Y )1 = (I t Y )1 Y

[219, p.493]

D.2. TABLES OF GRADIENTS AND DERIVATIVES

D.2.7

601

exponential

Matrix exponential. [80, 3.6, 4.5] [348, 5.4]


X etr(Y

X)

= X det eY

X tr eY X = eY

XT

= etr(Y

Y T = Y T eX

X)

( X , Y )

YT

x 1T eAx = ATeAx
x 1T e|Ax| = AT (sgn(Ax))e|Ax|

(Ax)i 6= 0

1
ex
1T e x

1
1
x
x xT
2
T x
x log(1 e ) = T x (e ) T x e e
1 e
1 e
x log(1T e x ) =

x
x2

k
Q

i=1
k
Q

xik =

1
k

i=1

d tY
dt e

1
k

xik =

k
Q

i=1

1
xik 1/x

k
Q

i=1

xik

1
(x)2 (1/x)(1/x)T
k

= etY Y = Y etY

d X+ t Y
dt e
d 2 X+ t Y
e
dt2

= eX+ t Y Y = Y eX+ t Y ,
= eX+ t Y Y 2 = Y eX+ t Y Y = Y 2 eX+ t Y ,

d j tr(X+ t Y )
e
dt j

= etr(X+ t Y ) tr j(Y )

XY = YX
XY = YX

S-ar putea să vă placă și