Sunteți pe pagina 1din 16

Linear Algebra & Analysis Review

As Covered In Class
UW EE/AA/ME 578 Convex Optimization

January 8, 2020

1 Four Fundamental Subspaces


Let A ∈ Rm×n . The range or column space of A, denoted by R(A) ⊆ Rm , is the set of all linear
combinations (span) of the columns of A,

R(A) = {Ax | x ∈ Rn }. (1)


.
dim R(A) = r = rankA ≤ min{m, n}. A matrix is full rank if r = min{m, n}.
The nullspace or kernal of A, denoted by N (A) ⊆ Rn , is the set of vectors mapped into zero by
A,
N (A) = {x | Ax = 0}. (2)
dim N (A) = n − r.
The row space of A, denoted by R(AT ) ⊆ Rn , is the set of all linear combinations (span) of its
rows,
R(AT ) = {AT x | x ∈ Rm }. (3)
dim R(AT ) = r.
The left nullspace of A, denoted by N (AT ) ⊆ Rm , is the set of all vectors mapped to zeros by
T
A ,
N (AT ) = {x | AT x = 0}. (4)
dim N (AT ) = m − r.

2 Eigenvalues and Their Associated Properties


1. Trace
The trace of a matrix A ∈ Rn×n is defined as
n
X
tr(A) = aii (5)
i=1
n
X
= λi (6)
i=1

1
where aii and λi , i = 1, . . . , n, are diagonal elements and eigenvalues of A, respectively. We
can derive (6) from (5) in at least two ways.

(a) Method 1: We write out the characteristic polynomial of A.

pA (t) = det(tI − X) = (t − λ1 )(t − λ2 ) . . . (t − λn )

Expand the determinant to see that the sum of the diagonal entries exactly equals the
sum of the eigenvalues of A.
Note that the first definition of pA (t) gives us the fact that A ∈ Rn×n has exactly n
eigenvalues, counting multiplicities.
(b) Method 2:
First recall the cyclic permutation property of trace is that

tr(ABC) = tr(BCA) = tr(CAB)

This is derived simply from definition.


(The cyclic permutation property comes up a lot in problems to prove convexity, usually
as the trick:
xT Ax = Tr(xT Ax) = Tr(AxxT )
)
We also need the Schur’s Unitary Triangularization Theorem which states that any
matrix A ∈ Rn×n is unitarily equivalent to an upper triangular matrix T , and the
diagonal entries of T are the eigenvalues of A, like so:

A = U ∗T U

where U is unitary and T upper triangular.


Using these, we have:

tr(A) = tr(U ∗ T U )
= tr(U U ∗ T )
= tr(T )
Xn
= λi
i=1

This proves (5).

Another key property of trace is that it is a linear mapping. As we will see, this is often used
in determining convexity of more complicated functions constructed out of the trace.

tr(A + B) = tr(A) + tr(B) (7)


tr(cA) = ctr(A) (8)

2
The last thing we want to note about the trace of a matrix is that it is similarity invariant.
A similarity transformation of a matrix is its representation in another basis. B ∈ Rn×n is
said to be similar to A ∈ Rn×n if there exists a nonsingular S ∈ Rn×n such that

B = S −1 AS

Similar matrices have the same characteristic polynomial, as seen below:

pB (t) = det(tI − B)
= det(tI − S −1 AS)
= det(S −1 (tI − A)S)
= det(S)det(tI − A)det(S)
= det(tI − A) = pA (t)

This means similar matrices in fact have identical eigenvalues, including multiplicities, and
therefore also equal trace, determinant and rank.

tr(QAQ−1 ) = tr(Q−1 QA) = tr(A) (9)

2. Congruence Two matrices A, B ∈ Rn×n are said to be congruent (denoted as B A )if there
exists a nonsingular matrix Q ∈ Rn×n such that A = QBQT . If A, B are congruent and B is
symmetric then the number of positive, negative and zero eigenvalues of A, B are the same.
Therefore, if B  0 then A  0.

3. Diagonalizability
A matrix A is diagonalizable if it is similar to a diagonal matrix. A special kind of similarity
we will often use in our problems is the unitary equivalence of any matrix to a diagonal matrix.
That is, for a matrix A, the following holds:
n
X
T
A = QΛQ = λi qi qiT (10)
i=1

where Q ∈ Rn×n is orthonormal, i.e.QT Q = I, and Λ = diag(λ1 , . . . , λn ). Simply take the


transpose of the above expression on both sides to note that this means A ∈ Sn×n . However,
this property holds for a much larger class of matrices, the normal matrices, which have
AAT = AT A.
We can use this property to derive the fact that the eigenvalues of a symmetric matrix are
all real. This is because A ∼ Λ, therefore Λ = ΛT , which means all the eigenvalues must be
real. Therefore we can order the eigenvalues of a symmetric matrix; usually we denote λi (A)
for the i-th largest eigenvalue of A.
In fact the largest and smallest eigenvalues of a symmetric matrix have a very useful varia-
tional characterization given by the Rayleigh-Ritz coefficient, named after two British physi-
cists (Dr. Rayleigh won the Nobel Prize in Physics for the discovery of argon. Clearly, he

3
was an extremely versatile scientist!). The variational characterization for A ∈ Sn×n is:

λmax (A) = max xT Ax (11)


kxk2 =1

λmin (A) = min xT Ax


kxk2 =1

We can see (11) easily. First note that f (x) = xT Ax is a continuous function and {x|kxk2 = 1}
is a compact set. So by Weierstrass Theorem, the maximum is attained. Now we diagonalize
A using (10):

A = QΛQT
xT Ax = xT QΛQT x
= y T Λy
Xn
= λi kyi k22 (12)
i=1

Since Q is unitary, kxk2 = kyk2 = 1. Under this condition, (12) is maximized by choosing
yi = 1 for i corresponding to λmax and 0 for all other indices.
In fact, there exist similar variational characterizations for all eigenvalues of a symmetric
matrix (look up Courant-Fisher Theorem).
We can use (10) to define the k-th(k ≥ 1) power of A ∈ Sn as Ak = QΛk QT . This is
particularly useful for the case of k = 1/2, giving us the square root of A. Note that the
square root is only defined for positive semidefinite matrices.

4. Positive Definiteness Suppose A ∈ Sn be a real symmetric matrix with eigenvalue decom-


position A = QΛQT .

(a) A is said to be positive semidefinite(PSD), denoted by A  0, if xT Ax ≥ 0 for all x ∈ Rn .


A real symmetric matrix A is said to be positive definite (PD), A  0, if xT Ax > 0 for
all x 6= 0, x ∈ Rn . The set of all positive semidefinite matrices Sn+ is a proper cone (see
definition on pp. 43, Figure 2.12 on pp. 35 and Example 2.24 on pp. 52).
(b) A is said to be negative semidefinite, denoted by A  0, if A  0, and is said to be
negative definite, denoted by A ≺ 0, if A ≺ 0.
(c) A  ()0 is equivalent to
• All eigenvalues of A are nonnegative (positive), i.e. λi ≥ (>)0, i = 1, . . . , n.
• All the (leading) principle minors of A are nonnegative (positive).
• There exists a (nonsingular) square matrix B ∈ Rn×n such that A = B T B.
Example 1. • Let B ∈ Rm×n . B T B  0 since xT B T Bx = kBxk22 ≥ 0 for all
x ∈ Rn . BB T  0 since xT BB T x = kB T xk22 ≥ 0 for all x ∈ Rm . B T B  0 and
BB T  0 if B has full rank.

4
m×n , and
P
• Given the observation data X = [xP 1 , . . . , xn ] ∈ R i xi = 0. The covari-
T n T m×m
ance matrix of X is C = XX = i=1 xi xi ∈ R . The Gram matrix of X is
G = X T X ∈ Rm×m , Gij = xTi xj . The covariance matrix and Gram matrix are both
positive semidefinite. See more on Gram matrix on pp. 405-408.
(d) If A  ()0, then
• tr(A) ≥ (>)0, det(A) ≥ (>)0.
• the k-th (0 ≤ k ≤ 1) root of A is defined as A1/k = QΛ1/k QT . Especially the square
root of A is A1/2 = QΛ1/2 QT .
• If B  ()0 then the inner product tr(AB)  ()0, the Hadamard product A ◦ B =
(Aij Bij )  ()0. However the matrix product AB  0 only when AB = BA.
(e) If A  0, then
• A−1  0.
• A can be factored as
A = LLT (13)
where L is lower triangular and nonsingular with positive diagonal elements. This
called the Cholesky factorization of A. See solving positive definite sets of equations
using Cholesky factorization on pp. 669.
(f) Let A, B, C, D ∈ Rn×n . The matrix inequalities(partial order on Rn ) are defined as
A  B if A − B  0; A  B if A − B  0. Many standard properties holds:
• Addition: if A  ()B, C  D then A + C  ()B + D. Especially if A  (
)0, B  0 then A + B  ()0.
• Nonnegative (positive) scaling: if A  ()B, α ≥ (>)0 then αA  ()αB. Espe-
cially if A  ()0, α ≥ (>)0 then αA ≥ (>)0.
• Transition: if A  ()B, B  ()C then A  ()C.
• Reflexive: A  A.
• Antisymmetric: if A  B, B  A then A = B.
• If A  B then for C, D small enough, A + C  B + D.
• If A  0, B  0, A  B then λi (A) ≥ λi (B), i = 1, . . . , n, trA ≥ trB, det(A) ≥
det(B).
See more on generalized inequalities and properties on pp. 43-44.

3 Inner product and norms


1. The standard inner product on Rn is defined as
n
X
hx, yi = xT y = xi yi (14)
i=1

for x, y ∈ Rn .
The matrix inner product on Rm×n is given by
n X
X n
T
hX, Y i = tr(X Y ) = Xij Yij (15)
i=1 i=1

5
for X, Y ∈ Rm×n . Note that the inner product of two matrices is the inner product of the
associated vectors in Rmn , obtained by stacking the elements of the matrices.
2. A real-valued function kxk for all x ∈ Rn is said to be a norm if
• kxk ≥ 0; kxk = 0 if and only if x = 0 (Positivity)
• kαxk = |α|kxk for α ∈ R (Homogeneity)
• kx + yk ≤ kxk + kyk (Triangle inequality)
A matrix norm on Rm×n can be defined similarly. The vector norm is the measure of the
length of the vector.
The reason we group norms and inner products together is that we can derive some norms
from inner products. The definition of an inner product includes all the properties of a norm.
For instance, the `2 -norm is associated with the standard inner product.
3. The vector `1 -norm on Rn is defined as
n
X
kxk1 = |xi |, (16)
i=1

that is the sum of absolute value of each entry.


The matrix nuclear norm or trace norm on Rm×n is given by
r
X
kXk∗ = σi (X) = tr((X T X)1/2 ), (17)
i=1

where r is the rank of X. In this note, σi (X) denotes the i-th largest singular value of the
rectangular matrix X, which is also equal to the square root of the i-th largest eigenvalue
of the square matrix X T X. The nuclear norm is just the `1 -norm of the vector of singular
values.
The matrix max-column-sum norm on Rn is
Xm
kXk1 = max |Xij | = sup{kXuk1 | kuk1 ≤ 1}. (18)
j=1,...,n
i=1

Example 2 (Sparse and low-rank structure). Eg. Support vector classifier, pp. 425.

4. The vector `2 -norm or Euclidean norm on Rn is defined as


n
!1/2
X
kxk2 = (xT x)1/2 = x2i . (19)
i=1

The Frobenius norm on Rm×n is given by


!1/2  1/2
r m X
n
X 1/2 X
kXkF = σi (X)2 = tr(X T X) = Xij2  . (20)
i=1 i=1 j=1

The Frobenius norm is the `2 -norm of the vector of singular values. Also, it is the Euclidean
norm (or `2 -norm) of the vector obtained by stacking the elements of the matrix.

6
5. The vector `∞ -norm or Chebyshev norm on Rn is defined as

kxk∞ = max{|x1 |, . . . , |xn |}. (21)

The spectral norm on Rm×n is given by its maximum singular value,

kXk2 = σ1 (X). (22)

The spectral norm is the `∞ -norm of the vector of singular values. The spectral norm gives
the largest amplification factor or maximum gain of X.
The max-row-sum norm on Rm×n is
n
X
kXk∞ = max |Xij | = sup{kXuk∞ | kuk∞ ≤ 1}. (23)
i=1,...,m
j=1

6. The above defined vector `1 , `2 , `∞ − norms are three special cases of a family of norms,
`p -norm on Rn with p ≥ 1
n
!1/p
xpi
X
kxkp = . (24)
i=1
The above matrix max-column-sum, Frobenius, max-row-sum norms are three special cases
of a family of operator norms on Rm×n defined as

kXka,b = sup{kXuka | kukb ≤ 1}. (25)

The operator norms with a = b = 1, 2, ∞ reduce to the matrix max-column-sum, Frobenius,


max-row-sum norms, respectively.
Example 3 (Ball and ellipsoid). {x | xT P x ≤ 1}

7. The dual norm, denoted k · k∗ associated with norm k · k on Rn is defined as

kzk∗ = sup{z T x | kxk ≤ 1} (26)

The dual norm k · k∗ associated with the matrix norm k · k on Rm×n is

kZk∗ = sup{tr(Z T X) | kXk ≤ 1} (27)

The dual norm of a dual norm is the original norm, kxk∗∗ = kxk for all x. For vector norms,
the `1 -norm is the dual norm of the `∞ -norm; the `2 -norm is the dual norm of itself. For
matrix norms, the nuclear norm is the dual norm of the spectral norm; the Frobenius norm
is the dual norm of itself. Generally, the dual norm of the `p -norm is the `q -norm, where p, q
satisfy p1 + 1q = 1. Since the inequality

z T x ≤ kxkkzk∗ (28)

for all x, z, thus it is equivalent to the Hölder’s Inequality

z T x ≤ kzkp kxkq (29)

7
where p1 + 1
q = 1. The special case p = q = 2 is often referred as the Cauchy-Schwarz
Inequality.
z T x ≤ kxk2 kzk2 (30)
tr(Z T X) ≤ kXkF kZkF . (31)

8. Inequalities related to norms. For all x ∈ Rn , the following inequalities hold



kxk∞ ≤ kxk2 ≤ kxk1 ≤ nkxk2 ≤ nkxk∞ . (32)

For all X ∈ Rm×n with rank(X) = r, the following inequalities hold



kXk2 ≤ kXkF ≤ kXk2∗ ≤ rkXkF ≤ rkXk2 . (33)

Proof of chain of inequalities. Norm ball reversed with size.

4 Matrix decomposition
4.1 Eigenvalue decomposition
Suppose a square matrix A ∈ Rn×n .

1. A nonzero vector q ∈ Rn is called an eigenvector of A if there exists a scalar λ such that

Aq = λq (34)

or
det(λI − A) = 0. (35)
The scaler λ is an eigenvalue of A. This can also be written as

AQ = QΛ (36)

where Q = [q1 , . . . , qn ] and Λ = diag(λ1 , . . . , λn ). The eigenvalues are roots of the charac-
teristic polynomial det(sI − A). Any matrix A ∈ Rn×n has n eigenvalues. The eigenvectors
associated with a single eigenvalue λ together with the zero vector form a linear vector sub-
space called an eigenspace.

2. The algebraic multiplicity ζ of an eigenvalue ζ is the multiplicity of the corresponding root of


the characteristic polynomial. The geometric multiplicity η of an eigenvalue λ is the dimension
PNλ eigenspace, i.e.dim N (λI − A). For all eigenvalues, η ≤ ζ. rank(A) =
of the associated
dim R(A) = i=1 ηi , where Nλ is the number of distinct eigenvalues of A. If ζ = η for all
eigenvalues of A, i.e.A has a set of linearly independent eigenvectors, then A is said to be
diagonalizable or nondefective. If all eigenvalues of A are distinct, A is diagonalizable.

3. A is said to be diagonalizable if A can be factored as

A = QΛQ−1 (37)

where Q is invertible. This is called the eigenvalue decompostision or spectral decomposition


of A.

8
4. Suppose A ∈ Sn be a symmetric matrix. All eigenvalues of a symmetric matrix are real. A
can be factored as
Xn
A = QΛQT = λi qi qiT (38)
i=1

where Q ∈ Rn×n is orthonormal, i.e.QT Q = I, and Λ = diag(λ1 , . . . , λn ). Usually the (real)


engenvalues are ordered, i.e., λi (A) is the i-th largest eigenvalue of A.

5. The k-th(k ≥ 1) power of A ∈ Sn is defined as Ak = QΛk QT .

4.2 Singular value decomposition and pseudo-inverse


Suppose A ∈ Rm×n with rankA = r.
1. A can be factored as
r
X
A = U ΣV T = σi ui viT (39)
i=1

where U ∈ Rm×r is an orthonormal matrix of left singular vectors, U T U = I, V ∈ Rn×r


is an orthonormal matrix of right singular vectors, V T V = I, Σ = diag(σ1 , . . . , σr ) is a
diagonal matrix of ordered singular values σ1 ≥ . . . ≥ σr > 0. This is called singular value
decomposition of A.

2. The singular value decomposition of A is related to the eigenvalue decomposition of the


symmetric matrix AT A and AAT .
 2 
T T T 2 T Σ 0
A A = V ΣU U ΣV = V Σ V = [V Ṽ ] [V Ṽ ]T (40)
0 0
 2 
Σ 0
AAT = U ΣV T V ΣU T = U Σ2 U T = [U Ũ ] [U Ũ ]T (41)
0 0
where Ṽ , Ũ are any matrices for which [V Ṽ ] and [U Ũ ] are orthonormal. The righthand
expressions are eigenvalue decompositions of AT A and AAT .

• p
The singular values
p σi are the squareroots of eigenvalues of AT A and AAT , i.e. σi (A) =
λi (AT A) = λi (AAT ) (λi (AT A) = λi (AAT ) = 0 for i > r).
• The left singular vectors U = [u1 , . . . , ur ] are eigenvectors of AAT and also an orthonor-
mal basis for R(A).
• The right singular vectors V = [v1 , . . . , vr ] are eigenvectors of AT A and also an orthonor-
mal basis for R(AT ).

3. Let A = U1 Σ1 V1T be the singular value decomposition of A. The full singular value decom-
position of A is
  T 
T Σ1 0r×(n−r) V1
A = U1 Σ1 V1 = [U1 U2 ] (42)
0(m−r)×r 0(m−r)×(n−r) V2T

where U2 ∈ Rm×(m−r) , V2 ∈ Rn×(n−r) are any matrices for which [U1 U2 ] ∈ Rm×m and
[V1 V2 ] ∈ Rn×n are orthonormal.

9
• U1 is an orthonormal basis of R(A).
• V1 is an orthonormal basis of R(AT ).
• U2 is an orthonormal basis of N (AT ).
• V2 is an orthonormal basis of N (A).

Therefore, R(A) is the orthogonal complement of N (AT ), R(AT ) is the orthogonal comple-
ment of N (A), and
⊥ ⊥
R(A) ⊕ N (AT ) = Rn , R(AT ) ⊕ N (A) = Rn (43)

where ⊕ refers to orthogonal direct sum.

4. Let A has the full singular value decomposition as defined above. The pseudo-inverse or
Moore-Penrose inverse of A, denoted by A† ∈ Rn×m is defined as

A† = V1 Σ−1 U1T . (44)

If rankA = n (tall, i.e. m > n, and full rank), then A† = (AT A)−1 AT ; if rankA = m (fat,
i.e. m < n, and full rank), then A† = AT (AAT )−1 . If A is square and nonsingular, A† = A−1 .

• AA† = U1 U1T ∈ Rm×m gives projection on R(A).


• A† A = V1 V1T ∈ Rn×n gives projection on R(AT ).
• I − AA† = U2 U2T ∈ Rm×m gives projection on N (AT ).
• I − A† A = V2 V2T ∈ Rn×n gives projection on N (A).

5 Quadratic forms and matrix gain


Suppose A ∈ Rn×n a square matrix.

1. A function f : Rn → R is called a quadratic form if it is of the form


n X
X n
f (x) = xT Ax = Aij xi xj . (45)
i=1 j=1

In a quadratic form we may assume A = AT since

xT Ax = xT ((A + AT )/2)x (46)

where (A + AT )/2 is called the symmetric part of A. The antisymmetric part of A is (A −


AT )/2. Each matrix can be written as the sum of symmetric part and antisymmetric part,

A + AT A − AT
A= + . (47)
2 2

10
2. The largest and smallest eigenvalues satisfy

xT Ax xT Ax
λmax (A) = λ1 (A) = sup , λmin (A) = λn (A) = inf . (48)
x6=0 xT x x6=0 xT x

Thus for any x,


λmin (A)xT x ≤ xT Ax ≤ λmax (A)xT x. (49)
Analogously the largest and smallest singular values satisfy
p
xT By kByk2 y T B T By
q
σmax (B) = sup = sup = sup = λmax (B T B) (50)
x,y6=0 kxk2 kyk2 y6=0 kyk2 y6=0 kyk2

where this is also the spectral norm of B. and


p
xT By kByk2 y T B T By
q
σmin (B) = inf = inf = inf = λmin (B T B). (51)
x,y6=0 kxk2 kyk2 y6=0 kyk2 y6=0 kyk2
To generalize we have the following definition.

3. The matrix gain or amplification factor of B in the direction x is defined as


kBxk
. (52)
kxk

The maximum (minimum) gain direction of B is that of the eigenvector associated with the
largest (smallest) eigenvalue.

6 Schur complement
1. Let X ∈ Rn×n be  
A B
X= (53)
C D
where A ∈ Rk×k , B ∈ Rk×(n−k) , C ∈ R(n−k)×k , D ∈ R(n−k)×(n−k) . Assume A is nonsingular,
to solve the linear equation     
A B x u
= (54)
C D y v
we eliminate x from the top block equation

x = A−1 (u − By). (55)

Then substitute it into the bottom block equation and, if (D − CA−1 B)−1 is nonsingular,
obtain

y = (D − CA−1 B)−1 (v − CA−1 u) = −(D − CA−1 B)−1 CA−1 u + (D − CA−1 B)−1 v. (56)

Substituting it to the first block equation yields

x = (A−1 + A−1 B(D − CA−1 B)−1 CA−1 )u − A−1 B(D − CA−1 B)−1 v. (57)

11
The Schur complement of A in X is defined as
S = D − CA−1 B ∈ R(n−k)×(n−k) , (58)
and x and y can be written in terms of S:
x = (A−1 + A−1 BS −1 CA−1 )u − A−1 BS −1 v, (59)
−1 −1 −1
y = −S CA u+S v. (60)

These two equations yield a formular for the inverse of a block matrix
−1  −1
A + A−1 BS −1 CA−1 −A−1 BS −1
 
A B
= . (61)
C D −S −1 CA−1 S −1
and −1 
I −A−1 B
   −1  
A B A 0 I 0
= . (62)
C D 0 I 0 S −1 −CA−1 I
It follows immediately that
I A−1 B
     
A B I 0 A 0
= . (63)
C D CA−1 I 0 S 0 I
Similarly if D is nonsingular, the Schur complement of D in X is defined as
Ŝ = A − BD−1 C ∈ Rk×k . (64)
Then we have
−1 
I −BD−1
   −1  
A B I 0 Ŝ 0
= . (65)
C D −D−1 C I 0 D−1 0 I
I BD−1
     
A B Ŝ 0 I 0
= . (66)
C D 0 I 0 D D−1 C I
2. Let X ∈ Sn , A be nonsingular,
I A−1 B
     
A B I 0 A 0
X= = , (67)
BT D B T A−1 I 0 S 0 I
where S = D − B T A−1 B ∈ R(n−k)×(n−k) is the Schur complement of A in X, A ∈ Sk , B ∈
Rk×(n−k) , D ∈ Sn−k . Note that X and
 
A 0
Y = (68)
0 S
are congruent matrices, therefore, if X  0 then Y  0. The block diagonal matrix Y is
positive semidefinite if and only if each diagonal block is positive semidefinite. Then the
characterization of positive definite or semidefinite block matrix of X are as follows,
• X  0 if and only if A  0 and S  0.
• If A  0, then X  0 if and only if S  0.
3. The interpretation of the Schur complement from minimizing a quadratic form on pp. 650 or
pp. 88.
4. The Schur complement can be generalized to the case when A is singular. See more on pp.
651.

12
7 Multivariate calculus
1. Suppose that a function f is differentiable in its domain and x ∈ int domf (the interior of
the domain of f ).
The gradient of a real-valued function f : Rn → R at x is the vector ∇f (x) ∈ Rn with
elements
∂f
∇f (x)i = , i = 1, . . . , n. (69)
∂xi

The Jacobian of a vector-valued function f : Rn → Rm at x is the matrix Df (x) ∈ Rm×n


with elements
∂fi
Df (x)ij = . (70)
∂xj
When f is real-valued, the gradient is the transpose of the Jacobian, ∇f (x) = Df (x)T .
The first-order approximation of f at a point x is

fˆ(z) = f (x) + ∇f (x)T (z − x). (71)

2. Suppose that a real-valued function f : Rn → R is twice differential in its domain and


x ∈ int domf . The Hessian matrix or second-order derivative of f at x is the matrix
∇2 f (x) ∈ Rn×n with elements
∂ 2 f (x)
∇2 f (x)ij = . (72)
∂xi ∂xj
The second-order approximation of f at a point x is
1
fˆ(z) = f (x) + ∇f (x)T (z − x) + (z − x)T ∇2 f (x)(z − x). (73)
2
Example 4. • f (x) = bT x = xT b where b and x ∈ Rn ,

∇f (x) = b (74)

f (x) = Ax where A ∈ Rm×n ,


Df (x) = A (75)

• f (x) = xT x,
∇f (x) = 2x (76)
f (x) = xT Ax where A ∈ Rn×n ,

∇f (x) = (A + AT )x (77)

∇2 f (x) = A + AT (78)
If A is symmetric, i.e.A ∈ Sn , then ∇xT Ax = 2Ax, ∇2 xT Ax = 2A.

13
• f (x) = log det X where X ∈ Sn++ (pp. 641),

∇f (X) = X −1 (79)

The Hessian has a quadratic form

U ∇2 f (X)V = −tr(X −1 U X −1 V ) (80)

where U, V ∈ Sn .
• More examples in the first- and second-order conditions of convex functions on pp. 69-
74.

3. Suppose f : Rn → Rm and g : Rm → R are differentiable. The composition h : Rn → R is


defined as h(x) = g(f (x)). The gradient of h is

∇h(x) = Df (x)∇g(f (x)). (81)

This follows from the general chain rule

Dh(x) = Dg(f (x))Df (x). (82)

As an example, suppose f : Rn → R and g : R → R,

∇h(x) = g 0 (f (x))∇f (x). (83)

and
∇2 h(x) = g 0 (f (x))∇2 f (x) + g 00 (f (x))∇f (x)∇f (x)T . (84)
Example 5. • Composition with affine function: f (x) = Ax + b, where A ∈ Rm×n , b ∈
Rm ,
∇h(x) = AT ∇g(Ax + b) (85)
2 T 2
∇ h(x) = A ∇ g(Ax + b)A. (86)
• Log-sum-exponential-affine: h : Rn → R,
m
X
h(x) = log exp(aTi x + bi ) (87)
i=1

where a1 , . . . am ∈ Rn and b1 , . . . , bm ∈ R, is composted by f : Rn → Rm and g : Rm →


R,
f (x) = Ax + b (88)
where A ∈ Rm×n with rows aT1 , . . . , aTm , b = [b1 , . . . , bm ]T , and
m
X
g(y) = log exp yi (89)
i=1

with
1
∇g(y) = Pm [exp y1 , . . . , exp ym ]T , (90)
i=1 exp yi

14
∇2 g(x) = diag(∇g(y)) − ∇g(y)∇g(y)T . (91)
Therefore,
∇h(x) = AT ∇g(Ax + b), (92)
∇2 h(x) = AT ∇2 g(Ax + b)A, (93)
which can be written as
AT z
∇h(x) = T , (94)
1 z
zz T
 
2 T diag(z)
∇ h(x) = A − T 2 A, (95)
1T z (1 z)
where zi = exp(aTi x + bi ), i = 1, . . . , m. See more on log-sum-exp function on pp. 72.

8 Basic analysis
1. x ∈ C ⊆ Rn is an interior point of C if there exists an  > 0 for which {y | ky − xk ≤ } ⊆ C,
i.e.there exists a ball centered at x that lies entirely in C. The interior of C, intC, is the
set of all interior points of C. The complement of a set C is defined as C c = Rn \C = {x ∈
Rn | x ∈/ C}.

2. A set C is open if for any x ∈ C, there exists  > 0 for which {y | ky − xk ≤ } ⊆


C,i.e. intC = C. A set C is closed if its complement is open. Any union of open sets is open.
The intersection of finitely many open sets is open. Any intersection of closed sets is closed.
The union of finitely many closed sets is closed. An alternative definition of a closed set is
that it contains the limits of all convergent sequences in C.

3. A set C is bounded if there exists M > 0 such that for all x ∈ C, kxk ≤ M . A set C is
compact if it is closed and bounded. Every continuous function on a compact set attains its
extreme values on that set.

4. The closure of a set C is clC = Rn \int(Rn \C), i.e.the complement of the interior of the
complement of C. If C is closed, clC = C.

5. The boundary of a set C is bdC = clC\intC. C is closed if it contains its boundary,


bdC ⊆ C. C is open if contains no boundary points, C ∩ bdC = ∅.

6. The relative interior of a set C, relintC, is its interior relative to aff C

relintC = {x ∈ C | B(x, r) ∩ aff C ⊆ C for some r > 0}, (96)

where B(x, r) = {y | ky − xk ≤ r}. The relative boundary of a set C is clC\relintC.

9 Acknowledgements
This note was revised by Yongjin Lee and Reza Eghbali from the version of Jan 8, 2013 by De
Dennis Meng. We would like to acknowledge the work by previous TAs, Brian Hutchinson, Karthik
Mohan and De Dennis Meng.

15
References
[1] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, 2004.

[2] R. Horn and C. Johnson, Matrix analysis, Cambridge University Press, 1990.

[3] Stanford EE263 “Linear Dynamical Systems” course materials,


http://www.stanford.edu/class/ee263/.

[4] J. Gallier, The schur complement and symmetric positive semidefinite and definite matrices,
http://www.cis.upenn.edu/~jean/schur-comp.pdf.

[5] J. Burke, Review notes for UW Math408,


http://www.math.washington.edu/~burke/crs/408/notes/review/.

[6] H. L. Royden and P. M. Fitzpatrick, Real analysis, Prentice Hall, 2010.

[7] J. Wilde, I. Tecu, and T. Suzuki, Linear Algebra II: Quadratic Forms and Definiteness,
http://www.econ.brown.edu/students/Takeshi_Suzuki/Math_Camp_2011/LA2-2011.pdf.

16

S-ar putea să vă placă și