Sunteți pe pagina 1din 13

An Elementary Introduction to Kalman Filtering

Yan Pei Swarnendu Biswas


University of Texas at Austin University of Texas at Austin
ypei@cs.utexas.edu sbiswas@ices.utexas.edu

Donald S. Fussell Keshav Pingali


University of Texas at Austin University of Texas at Austin
fussell@cs.utexas.edu pingali@cs.utexas.edu
arXiv:1710.04055v2 [cs.SY] 10 Nov 2017

ABSTRACT particular applications, and then show how these ideas can
Kalman filtering is a classic state estimation technique used be applied to solve particular problems such as state estima-
widely in engineering applications such as statistical signal tion in linear systems.
processing and control of vehicles. It is now being used to Abstractly, Kalman filtering can be viewed as an algorithm
solve problems in computer systems, such as controlling for combining imprecise estimates of some unknown value to
the voltage and frequency of processors to minimize energy obtain a more precise estimate of that value. We use informal
while meeting throughput requirements. methods similar to Kalman filtering in everyday life. When
Although there are many presentations of Kalman filtering we want to buy a house for example, we may ask a couple of
in the literature, they are usually focused on particular prob- real estate agents to give us independent estimates of what
lem domains such as linear systems with Gaussian noise or the house is worth. For now, we use the word “independent”
robot navigation, which makes it difficult to understand the informally to mean that the two agents are not allowed to
general principles behind Kalman filtering. In this paper, we consult with each other. If the two estimates are different,
first present the general statistical ideas behind Kalman fil- as is likely, how do we combine them into a single value to
tering at a level accessible to anyone with a basic knowledge make an offer on the house? This is an example of a data
of probability theory and calculus, and then show how these fusion problem.
abstract concepts can be applied to state estimation problems One solution to our real-estate problem is to take the aver-
in linear systems. This separation of abstract concepts from age of the two estimates; if these estimates are x 1 and x 2 , they
applications should make it easier to apply Kalman filtering are combined using the formula 0.5∗x 1 + 0.5∗x 2 . This gives
to other problems in computer systems. equal weight to the estimates. Suppose however we have ad-
ditional information about the two agents; perhaps the first
KEYWORDS one is a novice while the second one has a lot of experience
in real estate. In that case, we may have more confidence
Kalman filtering, data fusion, uncertainty, noise, state esti-
in the second estimate, so we may give it more weight by
mation, covariance, BLUE estimators, linear systems
using a formula such as 0.25∗x 1 + 0.75∗x 2 . In general, we can
consider a convex combination of the two estimates, which
1 INTRODUCTION is a formula of the form (1−α)∗x 1 + α∗x 2 , where 0 ≤ α ≤ 1;
Kalman filtering is a state estimation technique invented in intuitively, the more confidence we have in the second esti-
1960 by Rudolf E. Kálmán [14]. It is used in many areas includ- mate, the closer α should be to 1. In the extreme case, when
ing spacecraft navigation, motion planning in robotics, signal α=1, we discard the first estimate and use only the second
processing, and wireless sensor networks [11, 17, 21–23] be- estimate.
cause of its small computational and memory requirements, The expression (1−α)∗x 1 + α∗x 2 is an example of a linear
and its ability to extract useful information from noisy data. estimator [15]. The statistics behind Kalman filtering tell us
Recent work shows how Kalman filtering can be used in how to pick the optimal value of α: the weight given to an
controllers for computer systems [4, 12, 13, 19]. estimate should be proportional to the confidence we have
Although many presentations of Kalman filtering exist in that estimate, which is intuitively reasonable.
in the literature [1–3, 5–10, 16, 18, 23], they are usually fo- To quantify these ideas, we need to formalize the concepts
cused on particular applications like robot motion or state of estimates and confidence in estimates. Section 2 describes
estimation in linear systems with Gaussian noise. This can the model used in Kalman filtering. Estimates are modeled as
make it difficult to see how to apply Kalman filtering to other random samples from certain distributions, and confidence
problems. The goal of this paper is to present the abstract in estimates is quantified in terms of the variances and co-
statistical ideas behind Kalman filtering independently of variances of these distributions.
Sections 3-5 develop the two key statistical ideas behind p2 (x)
Kalman filtering.

p(x)
(1) How should uncertain estimates be fused optimally? p1 (x)
Section 3 shows how to fuse scalar estimates such
as house prices optimally. It is also shown that the
problem of fusing more than two estimates can be re- x
$500K $1 million
duced to the problem of fusing two estimates at a time,
without any loss in the quality of the final estimate.
Figure 1: Distributions.
Section 4 extends these results to estimates that are vec-
tors, such as state vectors representing the estimated
position and velocity of a robot or spacecraft. The
math is more complicated than in the scalar case but drawing a random sample x i from the distribution for agent
the basic ideas remain the same, except that instead of i.
working with variances of scalar estimates, we must Most presentations of Kalman filters assume distributions
work with covariance matrices of vector estimates. are Gaussian but we assume that we know only the mean
(2) In some applications, estimates are vectors but only µ i and the variance σi2 of each distribution pi . We write x i :
a part of the vector may be directly observable. For pi ∼(µ i , σi2 ) to denote that x i is a random sample drawn from
example, the state of a spacecraft may be represented distribution pi which has a mean of µ i and a variance of σi2 .
by its position and velocity, but only its position may The reciprocal of the variance of a distribution is sometimes
be directly observable. In such cases, how do we obtain called the precision of that distribution.
a complete estimate from a partial estimate? The informal notion of “confidence in the estimates made
Section 5 introduces the Best Linear Unbiased Estimator by an agent” is quantified by the variance of the distribution
(BLUE), which is used in Kalman filtering for this pur- from which the estimates are drawn. An experienced agent
pose. It can be seen as a generalization of the ordinary making high-confidence estimates is modeled by a distri-
least squares (OLS) method to problems in which data bution with a smaller variance than one used to model an
comes from distributions rather than being a set of inexperienced agent; notice that in Figure 1, the inexperi-
discrete points. enced agent is “all over the map.”
This approach to modeling confidence in estimates may
Section 6 shows how these results can be used to solve seem nonintuitive since there is no mention of how close
state estimation problems for linear systems, which is the the estimates made by an agent are to the actual value of the
usual context for presenting Kalman filters. First, we con- house. In particular, an agent can make estimates that are
sider the problem of state estimation when the entire state is very far off from the actual value of the house but as long as
observable, which can be solved using the data fusion results his estimates fall within a narrow range of values, we would
from Sections 3 and 4. Then we consider the more complex still say that we have high confidence in his estimates. In
problem of state estimation when the state is only partially statistics, this is explained by making a distinction between
observable, which requires in addition the BLUE estimator accuracy and precision. Accuracy is a measure of how close
from Section 5. Section 6.3 illustrates Kalman filtering with an estimate of a quantity is to the true value of that quan-
a concrete example. tity (the true value is sometimes called the ground truth).
Precision on the other hand is a measure of how close the
2 FORMALIZATION OF ESTIMATES estimates are to each other, and is defined without reference
This section makes precise the notions of estimates and con- to ground truth. A metaphor that is often used to explain
fidence in estimates, which were introduced informally in this difference is shooting at a bullseye. In this case, ground
Section 1. truth is provided by the center of the bullseye. A precise
shooter is one whose shots are clustered closely together
2.1 Scalar estimates even if they may be far from the bullseye. In contrast, the
One way to model the behavior of an agent producing scalar shots of an accurate but not precise shooter would be scat-
estimates such as house prices is through a probability dis- tered widely in a region surrounding the bullseye. For the
tribution function (usually shortened to distribution) like the problems considered in this paper, there may be no ground
ones shown in Figure 1 in which the x-axis represents values truth, and confidence in estimates is related to precision, not
that can be assigned to the house, and the y-axis represents accuracy.
the probability density. Each agent has its own distribution, The informal notion of getting independent estimates from
and obtaining an estimate from an agent corresponds to the two agents is modeled by requiring that estimates x 1 and
2
x 2 be uncorrelated; that is, E[(x 1 − µ 1 )(x 2 − µ 2 )] = 0. This is Lemma 2.2. Let x1 :p1 ∼(µµ 1 , Σ1 ), ..., xn :pn ∼(µµ n , Σn ) be a set
not the same thing as requiring them to be independent Ínpairwise uncorrelated random vectors of length m. Let y =
of
random variables, as explained in Appendix 8.1. Lemma 2.1 i=1 Ai xi . Then, the mean and variance of y are the following:
shows how the mean and variance of a linear combination n
Õ
of pairwise uncorrelated random variables can be computed µy = Ai µ i (3)
from the means and variances of the random variables. i=1
n
Lemma 2.1. Let x 1 :p1 ∼(µ 1 , σ12 ), ..., x n :pn ∼(µ n , σn2Í
Õ
) be a set Σy = Ai Σi ATi (4)
of pairwise uncorrelated random variables. Let y = ni=1 α i x i i=1
be a random variable that is a linear combination of the x i ’s.
Proof. The proof is similar to the proof of Lemma 2.1.
The mean and variance of y are the following:
Equation 3 follows from the linearity of the expectation
n
Õ operator.
µy = αi µi (1) Equation 4 can be proved as follows:
i=1
Õn Σy = E[(y − µ y )(y − µ y )T ]
σy2 = α i2σi2 (2)
i=1
= E[Σni=1 Σnj=1Ai (xi − µ i )(xj − µ j )TATj ]

Proof. Equation 1 follows from the fact that expectation = Σni=1 Σnj=1Ai E[(xi − µ i )(xj − µ j )T ]ATj
is a linear operator: The variables x1 , ..xn are pairwise uncorrelated, therefore
E[(xi − µ i )(xj − µ j )] = 0 if (i , j), from which the result
µy = E[y] = E[ ni=1α i x i ] = ni=1α i E[x i ] = ni=1α i µ i .
Í Í Í
follows. □
Equation 2 follows from linearity of the expectation opera-
tor and the fact that the estimates are pairwise uncorrelated: 3 FUSING SCALAR ESTIMATES
σy2 = E[(y − µy )2 ] = E[ ni=1 Σnj=1α i α j (x i − µ i )(x j − µ j )] Section 3.1 discusses the problem of fusing two scalar esti-
Í
mates. Section 3.2 generalizes this to the problem of fusing
= Σni=1 Σnj=1α i α j E[(x i − µ i )(x j − µ j )] n>2 scalar estimates. Section 3.3 shows that fusing n>2 esti-
Since the variables x 1 , . . . , x n are pairwise uncorrelated, mates can be done iteratively by fusing two estimates at a
E[(x i − µ i )(x j − µ j )] = 0 if i , j, from which the result fol- time without any loss of quality in the final estimate.
lows. □
3.1 Fusing two scalar estimates
2.2 Vector estimates We now consider the problem of choosing the optimal value
In some applications, estimates are vectors. For example, the of the parameter α in the formula yα (x 1 , x 2 )=(1−α)∗x 1 + α∗x 2
state of a robot moving along a single dimension might be for fusing uncorrelated estimates x 1 and x 2 . How should op-
represented by a vector containing its position and velocity. timality be defined? One reasonable definition is that the
Similarly, the vital signs of a person might be represented optimal value of α minimizes the variance of yα (x 1 , x 2 ); since
by a vector containing his temperature, pulse rate and blood confidence in an estimate is inversely proportional to the
pressure. In this paper, we denote a vector by a boldfaced variance of the distribution from which the estimates are
lowercase letter, and a matrix by an uppercase letter. The drawn, this definition of optimality will produce the highest-
covariance matrix of a random variable x:p(x) with mean confidence fused estimates. The variance of yα (x 1 , x 2 ) is
µ x is the matrix E[(x − µ x )(x − µ x )T ]. called the mean square error (MSE) of that estimator, and
Estimates: An estimate xi is a random sample drawn from it obviously depends on α; the minimum value of this vari-
a distribution with mean µ i and covariance matrix Σi , written ance as α is varied is called the minimum mean square error
as xi : pi ∼(µµ i , Σi ). The inverse of the covariance matrix Σi−1 error (MMSE) below.
is called the precision or information matrix. Note that if Theorem 3.1. Let x 1 :p1 (x)∼(µ 1 , σ12 ) and x 2 :p2 (x)∼(µ 2 , σ22 )
the dimension of xi is one, the covariance matrix reduces to be uncorrelated estimates, and suppose they are fused using
variance. the formula yα (x 1 , x 2 ) = (1−α)∗x 1 + α∗x 2 . The value of MSE(yα )
Uncorrelated estimates: Estimates xi and xj are uncorre- σ12
is minimized for α = .
lated if E[(xi −µµ i )(xj −µµ j )T ] = 0 . This is equivalent to saying σ 1 2 +σ 2 2
that every component of xi is uncorrelated with every com- Proof. From Lemma 2.1,
ponent of xj .
Lemma 2.2 generalizes Lemma 2.1. σy2 (α) = (1 − α)2 ∗σ 1 2 + α 2 ∗σ 2 2 . (5)
3
Differentiating σy2 (α) with respect to α and setting the deriva- Proof. From Lemma 2.1, σy2 (α) = ni=1 α i 2σ i 2 . To find the
Í

tive to zero proves the result. The second derivative, (σ12 +σ22 ), values of α i that minimize the variance σy2 under the con-
is positive, showing that σy2 (α) reaches a minimum at this straint that the α i ’s sum to 1, we use the method of Lagrange
point. multipliers. Define
In the literature, this optimal value of α is called the n
Õ n
Õ
Kalman gain K. □ f (α 1 , ..., α n ) = α i 2σ i 2 + λ( α i − 1)
i=1 i=1
Substituting K into the linear fusion model, we get the where λ is the Lagrange multiplier. Taking the partial deriva-
optimal linear estimator yb(x 1 , x 2 ): tives of f with respect to each α i and setting these derivatives
σ22 σ12 to zero, we find α 1σ12 = α 2σ22 = ... = α n σn2 = −λ/2. From this,
yb(x 1 , x 2 ) = ∗x + ∗x 2 and the fact that sum of the α i ’s is 1, the result follows. □
σ 12 + σ 22 σ 12 + σ 22
1

As a step towards fusion of n>2 estimates, it is useful to The variance is given by the following expression:
rewrite this as follows: 1
1 1
σ 2yb = n . (10)
Õ 1
σ12 σ22
yb(x 1 , x 2 ) = ∗x 1 + ∗x 2 (6) σi2
1
σ 12
+ 1
σ 22
1
σ 12
+ 1
σ 22
i=1
As in Section 3.1, these expressions are more intuitive if
Substituting K into Equation 5 gives the following expres-
the variance is replaced with precision.
sion for the variance of yb:
n
Õ νi
σyb2 =
1 yb(x 1 , .., x n ) = ∗ xi (11)
(7) ν +...+ν n
1
σ 12
+ 1
σ 22
i=1 1
n
Õ
The expressions for yb and σyb are complicated because they νyb = νi (12)
contain the reciprocals of variances. If we let ν 1 and ν 2 denote i=1
the precisions of the two distributions, the expressions for yb Equations 11 and 12 generalize Equations 8 and 9.
and νyb can be written more simply as follows:
ν1 ν2 3.3 Incremental fusing is optimal
yb(x 1 , x 2 ) = ∗x 1 + ∗x 2 (8)
ν1 + ν2 ν1 + ν2 In many applications, the estimates x 1 , x 2 , ..., x n become
νyb = ν 1 + ν 2 (9) available successively over a period of time. While it is
possible to store all the estimates and use Equations 11
These results say that the weight we should give to an and 12 to fuse all the estimates from scratch whenever a
estimate is proportional to the confidence we have in that new estimate becomes available, it is possible to save both
estimate, which is intuitively reasonable. Note that if µ 1 =µ 2 , time and storage if one can do this fusion incrementally.
the expectation E[yα ] is µ 1 (=µ 2 ) regardless of the value α. In this section, we show that just as a sequence of num-
In this case, yα is said to be an unbiased estimator, and the bers can be added by keeping a running sum and adding
optimal value of α minimizes the variance of the unbiased the numbers to this running sum one at a time, a sequence
estimator. of n>2 estimates can be fused by keeping a “running es-
timate” and fusing estimates from the sequence one at a
3.2 Fusing multiple scalar estimates time into this running estimate without any loss in the
The approach in Section 3.1 can be generalized to optimally quality of the final estimate. In short, we want to show
fuse multiple pairwise uncorrelated estimates x 1 , x 2 , ..., x n . that yb(x 1 , x 2 , ..., x n ) = yb(b
y (b
y (x 1 , x 2 ), x 3 )..., x n ). Note that this
Let yα (x 1 , .., x n ) denote the linear estimator given parame- is not the same thing as showing yb, interpreted as a binary
ters α 1 , .., α n , which we denote by α. function, is associative.
Figure 2 shows the process of incrementally fusing es-
Theorem 3.2. Let pairwise uncorrelated estimates x i (1≤i≤n) timates. Imagine that time progresses from left to right in
drawn from distributions piÍ (x)∼(µ i , σi2 ) be fused
Í using the lin- this picture. Estimate x 1 is available initially, and the other
ear model yα (x 1 , .., x n ) = ni=1 α i x i where ni=1 α i = 1. The estimates x i become available in succession; the precision of
value of MSE(yα ) is minimized for each estimate is shown in parentheses next to each estimate.
1 When the estimate x 2 becomes available, it is fused with x 1
σi2
αi = Ín 1 . using Equation 8. In Figure 2, the labels on the edges con-
i=1 σ 2
i
necting x 1 and x 2 to yb(x 1 , x 2 ) are the weights given to these
4
(ν 1 + ν 2 )
yb(x 1, x 2 )
(ν 1 + ν 2 + ν 3 ) (ν 1 + · · · + ν n−1 )
yb((x 1, x 2 ), x 3 ) yb(x 1, . . ., x n−1 )
(ν 1 + · · · + ν n )
yb(x 1, . . ., x n )
4.1 Fusing multiple vector estimates
ν1 ν 1 +ν 2 ν 1 +···+ν n−1
(ν 1 )
ν 1 +ν 2
+ ν 1 +ν 2 +ν 3
+ ... + ν 1 +···+νn
+
For vectors, the linear data fusion model is
x1 n
Õ n
Õ
ν2 ν3 ν n−1 νn
ν 1 +ν 2 ν 1 +ν 2 +ν 3 ν 1 +···+ν n−1 ν 1 +···+νn yA (x1 , x2 , .., xn ) = Ai xi where Ai = I . (17)
i=1 i=1
x2 x3 x n−1 xn
(ν 2 ) (ν 3 ) (ν n−1 ) (ν n )
Here A stands for the matrix parameters (A1 , ..., An ). All the
vectors are assumed to be of the same length. If the means µ i
Figure 2: Dataflow graph for incremental fusion. of the random variables xi are identical, this is an unbiased
estimator.
estimates in Equation 8. When estimate x 3 becomes available, Optimality: The MSE in this case is the expected value of
it is fused with yb(x 1 , x 2 ) using Equation 8; as before, the la- the two-norm of (yA −µµ yA ), which is E[(yA −µµ yA )T (yA −µµ yA )].
bels on the edges correspond to the weights given to yb(x 1 , x 2 ) Note that if the vectors have length 1, this reduces to variance.
and x 3 when they are fused to produce yb(b y (x 1 , x 2 ), x 3 ). The parameters A1 , ..., An in the linear data fusion model are
The contribution of x i to the final value yb(b y (b
y (x 0 , x 1 ), x 2 )..., x n ) chosen to minimize this MSE.
is given by the product of the weights on the path from x i to Theorem 4.1 generalizes Theorem 3.2 to the vector case.
the final value in Figure 2. As shown below, this product has The proof of this theorem uses matrix derivatives and is
the same value as the weight of x i in Equation 11, showing given in Appendix 8.3 since it is not needed for understand-
that incremental fusion is optimal. ing the rest of this paper. What is important is to compare
νi ν 1 + ... + νi ν 1 + ... + νn−1 Theorems 4.1 and 3.2 and notice that the expressions are
∗ ∗ ... ∗ similar, the main difference being that the role of variance
ν 1 + ... + νi ν 1 + ... + νi+1 ν 1 + ... + νn
νi in the scalar case is played by the covariance matrix in the
= vector case.
ν 1 + ... + νn
Theorem 4.1. Let pairwise uncorrelated estimates xi (1≤i≤n)
3.4 Summary drawn from distributions piÍ (x)=(µµ i , Σi ) be fused
Ín using the lin-
n
The main result in this section can be summarized infor- ear model y (
A 1 x , .., x n ) = A
i=1 i i x , where i=1 Ai = I . The
mally as follows. When using a linear model to fuse uncertain MSE( y A ) is minimized for
scalar estimates, the weight given to each estimate should be n
Õ
inversely proportional to the variance of that estimate. Fur- Ai = ( Σ−1
j ) Σi .
−1 −1
(18)
thermore, when fusing n>2 estimates, estimates can be fused j=1

incrementally without any loss in the quality of the final re- y can be
The covariance matrix of the optimal estimator b
sult. More formally, the results in this section for fusing scalar determined by substituting the optimal Ai values into the
estimates are often expressed in terms of the Kalman gain, expression for Σy in Lemma 2.2.
as shown below; these equations can be applied recursively n
Õ
to fuse multiple estimates. Σy
b =( Σ−1
j )
−1
(19)
x1 : p1 ∼(µ 1 , σ12 ), x2 : p2 ∼(µ 2 , σ22 ) j=1

In the vector case, precision is the inverse of a covariance


σ12 ν2
K= = (13) matrix, denoted by N . Equations 20–21 use precision to ex-
σ12 + σ22 ν1 + ν2 press the optimal estimator and its variance, and generalize
yb(x 1 , x 2 ) = x 1 + K(x 2 − x 1 ) (14) Equations 11–12 to the vector case.
µyb = µ 1 + K(µ 2 − µ 1 ) (15) n Õ
Õ n
 −1
y(x1 , ..., xn ) = N j Ni xi (20)
σyb2 = σ 1 − Kσ 1
2 2
or νyb = ν 1 + ν 2 (16)
b
i=1 j=1
n
Õ
4 FUSING VECTOR ESTIMATES Nyb = Nj (21)
This section addresses the problem of fusing estimates when j=1

the estimates are vectors. Although the math is more com- As in the scalar case, fusion of n>2 vector estimates can
plicated, the conclusion is that the results in Section 3 for be done incrementally without loss of precision. The proof
fusing scalar estimates can be extended to the vector case is similar to the one in Section 3.3, and is omitted.
simply by replacing variances with covariance matrices, as There are several equivalent expressions for the Kalman
shown in this section. gain for the fusion of two estimates. The following one,
5
which is easily derived from Equation 18, is the vector analog Figure 3 shows an example in which x and y are scalar-
of Equation 13: valued random variables. The grey ellipse in this figure, called
a confidence ellipse, is a projection of the joint distribution of
K = Σ1 (Σ1 + Σ2 )−1 (22)
x and y onto the (x, y) plane that shows where some large
y can be written as follows.
The covariance matrix of b proportion of the (x, y) values are likely to be. For a given
value x 1 , there are in general many points (x 1 , y) that lie
Σy
b = Σ1 (Σ1 + Σ2 ) Σ2
−1
(23) within this ellipse, but these y values are clustered around
= K Σ2 = Σ1 − K Σ1 (24) the line shown in the figure so the value yb1 is a reasonable
estimate for the value of y corresponding to x 1 . This line,
4.2 Summary called the best linear unbiased estimator (BLUE), is the analog
The results in this section can be summarized in terms of the of ordinary least squares (OLS) for distributions. Given a set
Kalman gain K as follows: of discrete points (x i , yi ) where each x i and yi are scalars,
OLS determines the “best” linear relationship between these
x1 : p1 ∼(µµ 1 , Σ1 ), x2 : p2 ∼(µµ 2 , Σ2 ) points, where best is defined as minimizing the square error
K = Σ1 (Σ1 + Σ2 )−1 = (N 1 + N 2 )−1 N 2 (25) between the predicted and actual values of yi . This relation
can then be used to predict a value for y given a value for
y(x1 , x2 ) = x1 + K(x2 − x1 )
b (26) x. The BLUE estimator presented below generalizes this to
µ y = µ 1 + K(µµ 2 − µ 1 ) (27) the case when x and y are vectors, and are random variables
Σyb = Σ1 − K Σ1 or Nyb = N 1 + N 2 (28) obtained from distributions instead of a set of discrete points.

5 BEST LINEAR UNBIASED ESTIMATOR 5.1 Computing BLUE


(BLUE) Let x:px ∼(µ x , Σxx ) and y:py ∼(µ y , Σyy ) be random variables.
In some applications, estimates are vectors but only part of Consider a linear estimator b yA,b (x)=Ax+b. How should we
the vector may be given to us directly, and it is necessary choose A and b? As in the OLS approach, we can pick values
to estimate the hidden portion. This section introduces a that minimize the MSE between random variable y and the
statistical method called the Best Linear Unbiased Estimator estimate by.
(BLUE).
MSE A,b (b
y) = E[(y − b
y)T (y − b
y)]
y = E[(y − (Ax + b))T (y − (Ax + b))]
= E[yT y − 2yT (Ax + b) + (Ax + b)T (Ax + b)]
y − µ y ) = Σyx Σ−1
(b xx (x − µ x )
Setting the partial derivatives of MSE A,b (b
y) with respect to
yb1 b and A to zero, we find that b = (µ y −Aµ x ), and A = Σyx Σ−1
xx ,
where Σyx is the covariance between y and x. Therefore, the
µx best linear estimator is
 
µy
y = µ y + Σyx Σ−1
b xx (x − µ x ) (29)

x This is an unbiased estimator because the mean of b y is


x1 equal to µ y . Note that if Σyx = 0 (that is, x and y are uncorre-
lated), the best estimate of y is just µ y , so knowing the value
Figure 3: BLUE line corresponding to Equation 29 of x does not give us any additional information about y as
one would expect. In Figure 3, this corresponds to the case
Consider the general problem of determining a value for when the BLUE line is parallel to the x-axis. At the other
vector y given a value for a vector x. If there is a functional extreme, suppose that y and x are functionally related so
relationship between x and y (say y=F (x) and F is given), y = C x. In that case, it is easy to see that Σyx = CΣxx , so
it is easy to compute y given a value for x. In our context yb = C x as expected. In Figure 3, this corresponds to the case
however, x and y are random variables so such a precise when the confidence ellipse shrinks down to the BLUE line.
functional relationship will not hold. The best we can do is The covariance matrix of this estimator is the following:
to estimate the likely value of y, given a value of x and the
Σy
b = Σyy − Σyx Σxx Σxy
−1
(30)
information we have about how x and y are correlated.
6
Intuitively, knowing the value of x permits us to reduce Equations 25:28 to fuse these estimates and compute the
the uncertainty in the value of y by an additive term that covariance matrix of this fused estimate. The fused estimate
depends on how strongly y and x are correlated. of the state is fed into the model of the system to compute
It is easy to show that Equation 29 is a generalization of the estimate of the state and its covariance at the next time
ordinary least squares in the sense that if we compute means step, and the entire process is repeated.
and variances of a set of discrete data (x i , yi ) and substitute Figure 4(b) shows the dataflow diagram of this compu-
into Equation 29, we get the same line that is obtained by tation. For each state xt +1 in the precise computation of
using OLS. Figure 4(a), there are three random variables in Figure 4(b):
the estimate from the model of the dynamical system, de-
6 KALMAN FILTERS FOR LINEAR noted by xt +1|t , the estimate from the measurement, denoted
SYSTEMS by zt +1 , and the fused estimate, denoted by xt +1 |t +1 . Intu-
itively, the notation xt +1|t stands for the estimate of the state
In this section, we apply the algorithms developed in Sec-
at time (t+1) given the information at time t, and it is often
tions 3-5 to the particular problem of estimating the state of
referred to as the a priori estimate. Similarly, xt +1 |t +1 is the
linear systems, which is the classical application of Kalman
corresponding estimate given the information available at
filtering.
time (t+1), which includes information from the measure-
Figure 4(a) shows how the evolution over time of the state
ment, and is often referred to as the a posteriori estimate. To
of such a system can be computed if the initial state x0
set up this computation, we introduce the following notation.
and the model of the system dynamics are known precisely.
Time advances in discrete steps. The state of the system • The initial state is denoted by x0 and its covariance by
at any time step is a function of the state of the system at Σ0|0 .
the previous time step and the control inputs applied to the • Uncertainty in the system model and control inputs is
system during that interval. This is usually expressed by represented by making xt +1|t a random variable and
an equation of the form xt +1 = ft (xt , ut ) where ut is the introducing a zero-mean noise term wt into the state
control input. The function ft is nonlinear in the general case, evolution equation, which becomes
and can be different for different steps. If the system is linear, xt +1|t = Ft xt |t + Bt ut + wt (31)
the relation for state evolution over time can be written as
xt +1 = Ft xt + Bt ut , where Ft and Bt are (time-dependent) The covariance matrix of wt is denoted by Q t and wt
matrices that can be determined from the physics of the is assumed to be uncorrelated with xt |t .
system. Therefore, if the initial state x0 is known exactly • The imprecise measurement at time t+1 is modeled by
and the system dynamics are modeled perfectly by the Ft a random variable zt +1 = xt +1 + vt +1 where vt +1 is a
and Bt matrices, the evolution of the state with time can be noise term. vt +1 has a covariance matrix R t +1 and is
computed precisely. uncorrelated to xt +1|t .
In general however, we may not know the initial state Examining Figure 4(c), we see that if we can compute
exactly, and the system dynamics and control inputs may Σt +1|t , the covariance matrix of xt +1|t , from Σt |t , we have
not be known precisely. These inaccuracies may cause the everything we need to implement the vector data fusion
computed and actual states to diverge unacceptably over technique described in Section 4. This can be done by apply-
time. To avoid this, we can make measurements of the state ing Equation 4, which tells us that Σt +1|t = Ft Σt |t FtT + Q t .
after each time step. If these measurements were exact and This equation propagates uncertainty in the input of ft to
the entire state could be observed after each time step, there its output.
would of course be no need to model the system dynamics. Figure 4(c) puts all the pieces together. Although the com-
However, in general, (i) the measurements themselves are putations appear to be complicated, the key thing to note is
imprecise, and (ii) some components of the state may not be that they are a direct application of the vector data fusion
directly observable by measurements. technique of Section 4 to the special case of linear systems.

6.1 Fusing complete observations of the 6.2 Fusing partial observations of the state
state If some components of the state cannot be measured directly,
If the entire state can be observed through measurements, the prediction phase remains unchanged from Section 6.1 but
we have two imprecise estimates for the state after each the fusion phase is different and can be understood intuitively
time step, one from the model of the system dynamics and in terms of the following steps.
one from the measurement. If these estimates are uncorre- (i) The portion of the a priori state estimate corresponding
lated and their covariance matrices are known, we can use to the observable part is fused with the measurement,
7
x0 x1 xt xt +1
f0 ..... ft

(a) Discrete-time dynamical system.

(Σ0|0 ) (Σ1 |0 ) (Σ1|1 ) (Σt |t ) (Σt +1|t ) (Σt +1|t +1 )


x0|0 x1 |0 x1|1 xt |t xt +1|t xt +1|t +1
f0 + ..... ft +

z1 zt +1
(R1 ) (Rt +1 )
(b) Dynamical system with uncertainty.

Predictor Measurement Fusion +


(Σ0|0 )
xt +1|t = Ft xt |t + Bt ut
(Σt +1|t ) Kt +1 = Σt +1 |t (Σt +1|t + Rt +1 )−1 (Σt +1|t +1 )
x0 xt +1|t xt +1|t +1
xt +1|t +1 = xt +1|t + Kt +1 (zt +1 - xt +1|t )
Σt +1|t = Ft Σt |t FTt + Qt Σt +1|t +1 = (I - Kt +1 )Σt +1|t
zt +1
(Rt +1 )

(c) Implementation of the dataflow diagram (b).

Predictor Measurement Fusion +


(Σ0|0 )
xt +1|t = Ft xt |t + Bt ut
(Σt +1|t ) Kt +1 = Σt +1|t HT (HΣt +1|t HT + Rt +1 )−1 (Σt +1|t +1 )
x0 xt +1|t xt +1|t +1
xt +1|t +1 = xt +1|t + Kt +1 (zt +1 - Hxt +1|t )
Σt +1|t = Ft Σt |t FTt + Qt Σt +1|t +1 = (I - Kt +1 H)Σt +1|t
zt +1
(Rt +1 )

(d) Implementation of the dataflow diagram (b) for systems with partial observability.

Figure 4: State estimation using Kalman filtering

using the techniques developed in Sections 3-4. The measured directly. We use the simplified notation below to
result is the a posteriori estimate of the observable state. focus on the essentials.
h
 
(ii) The BLUE estimator in Section 5 is used to obtain the a
• a priori state estimate: xi = i
posteriori estimate of the hidden state from the a poste- ci
riori estimate of the observable state. σh σhc
 2 
(iii) The a posteriori estimates of the observable and hidden • covariance matrix of a priori estimate: Σi =
σch σc2
portions of the state are composed to produce the a h
 
posteriori estimate of the entire state. • a posteriori state estimate: xo = o
co
The actual implementation produces the final result di- • measurement: z
rectly without going through these steps, as shown in Fig- • variance of measurement: r 2
ure 4(d) but these incremental steps are useful for under- The three steps discussed above for obtaining the a pos-
standing how all this works. teriori estimate involve the following calculations, shown
6.2.1 Example: 2D state. Figure 5 illustrates these steps pictorially in Figure 5.
for a two-dimensional problem in which the state-vector (i) The a priori estimate of the observable state is hi . The
has two components, and only the first component can be a posteriori estimate is obtained from Equation 14.
8
σh2
ho = hi + (z − hi ) = hi + Kh (z − hi ) (ii) The a priori estimate of the hidden state is C xt +1|t . The
(σh +r 2 )
2
a posteriori estimate is obtained from Equation 29:
(ii) The a priori estimate of the hidden state is c i . The a
posteriori estimate is obtained from Equation 29. C xt +1|t +1 =C xt +1|t +(CΣt +1|t H T )(H Σt +1|t H T )−1 HKt +1 (zt +1 −H xt +1|t )
=C xt +1|t + CKt +1 (zt +1 − H xt +1 |t ) (33)
σhc σh2 σhc (iii) Putting the a posteriori estimates (32) and (33) together,
co = ci + ∗ (z − hi ) = c i + (z − hi )
σh2 (σh2 + r 2 ) (σh2+ r 2)
= c i + Kc (z − hi ) H H H
     
x = x + K (z − H xt +1|t )
C t +1|t +1 C t +1|t C t +1 t +1
(iii) Putting
  these  together, we get
ho hi Kh H
   
= + (z − hi ) Since is invertible, it can be canceled from the left
co ci Kc C
As a step towards generalizing this result in Section 6.2.2, and right hand sides, giving the equation
it is useful to rewrite this
 expression using matrices. Define xt +1|t +1 = xt +1|t + Kt +1 (zt +1 − H xt +1|t ) (34)
H = 1 0 and R = r 2 . Then xo = xi + K(z − H xi ) where
To compute Σt +1|t +1 , the covariance of xt +1 |t +1 , note that
K = Σi ∗ H T (H Σi H T + R)−1 .
xt +1|t +1 = (I − Kt +1H )xt +1|t + Kt +1 zt +1 . Since xt +1|t and zt +1
are uncorrelated, it follows from Lemma 2.2 that
c
ho Σt +1|t +1 = (I − Kt +1H )Σt +1|t (I − Kt +1H )T + Kt +1 R t +1KtT+1
 
co
c0 Σo Substituting the value of Kt +1 and simplifying, we get
hi
 
ci Kc (z − hi ) Σt +1|t +1 = (I − Kt +1H )Σt +1|t (35)
r2
Σi ( )
Kh (z − hi ) z Figure 4(d) puts all this together. Note that if the entire
state is observable, H = I and the computations in Figure 4(d)
reduce to those of Figure 4(c) as expected.

h 6.3 An example
h0
To illustrate the concepts discussed in this section, we use a
simple example in which a car starts from the origin at time
Figure 5: Computing the a posteriori estimate when t=0 and moves right along the x-axis with an initial speed
part of the state is not observable of 100 m/sec. and a constant deceleration of 5m/sec.2 until
it comes to rest.
The state-vector of the car has two components, one for
6.2.2 General case. In general, suppose that the observ- the distance from the origin d(t) and one for the speed v(t).
able portion of a state estimate x is given by H x where H If time is discretized in steps of 0.25 seconds, the difference
is a full row-rank matrix. Let C be a basis for the orthogo- equation for the dynamics of the system is easily shown to
nal complement of H (in the 2D example in Section 6.2.1, be the following:
H = (1 0) and C = (0 1)). The covariance between C x and
dn+1 1 0.25 dn
       
H x is easily shown to be CΣH T where Σ is the covariance = +
0 0.03125 0
(36)
matrix of x. The three steps for computing the a posteriori vn+1 0 1 vn 0 0.25 −5
estimate from the a priori estimate in the general case are d0
   
0
where = .
the following. v0 100
(i) The a priori estimate of the observable state is H xt +1|t . The gray lines in Figure 6 show the evolution of velocity
The a posteriori estimate is obtained from Equation 26: and position with time according to this model. Because of
uncertainty in modeling the system dynamics, the actual
H xt +1|t +1 = H xt +1|t +H Σt +1|t H T (H Σt +1|t H T +R t +1 )−1 (zt +1 −H xt +1|t )
evolution of the velocity and position will be different in
Let Kt +1 =Σt +1|t H T (H Σt +1|t H T +R t +1 )−1 . The a posteri- practice. The red lines in Figure 6 shows one trajectory for
ori estimate of the observable state can be written in this evolution,
 corresponding
 to a Gaussian noise term with
terms of Kt +1 as follows: covariance
4 2
in Equation 31 (because this noise term
2 4
H xt +1|t +1 = H xt +1|t +HKt +1 (zt +1 −H xt +1|t ) (32) is random, there are many trajectories for the evolution, and
9
Velocity Distance
120 1200

100
1000

80
800

60

Distance
Velocity

600
Ground Truth
40 Estimated
Model-only

400
20
Measured
Model-only
Ground Truth
Estimated 200
0

-20 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
t t

Figure 6: Estimates of the car’s state over time

we are just showing one of them). The red lines correspond Gaussians however enter the picture in a deeper way if
to “ground truth” in our experiments. one considers nonlinear estimators. It can be shown that if
Suppose that only the velocity is directly observable. The the noise terms are not Gaussian, there may be nonlinear
green points in Figure 6 show the noisy measurements at estimators whose MSE is lower than that of the linear estima-
different time-steps, assuming the noise is modeled by a tor presented in Figure 4(d) but that if the noise is Gaussian,
Gaussian with variance 8. The blue lines show the a posteriori the linear estimator is as good as any unbiased nonlinear
estimates of the velocity and position. It can be seen that the estimator (that is, the linear estimator is a minimum variance
a posteriori estimates track the ground truth quite well even unbiased estimator (MVUE)). This result is proved using the
when the ideal system model (the grey lines) is inaccurate Cramer-Rao lower bound [20]. The presentation in this paper
and the measurements are noisy. is predicated on the belief that this property is not needed
to understand what Kalman filtering does.
6.4 Discussion
7 CONCLUSIONS
Equation 34 shows that the a posteriori state estimate is a
linear combination of the a priori state estimate (xt +1|t ) and Kalman filtering is a classic state estimation technique used
the estimate of the observable state from the measurement widely in engineering applications. Although there are many
(zt +1 ). Note however that unlike in Sections 3 and 4, the presentations of this technique in the literature, most of
Kalman gain is not a dimensionless value if the entire state them are focused on particular applications like robot mo-
is not observable. tion, which can make it difficult to understand how to apply
Although the optimality results of Sections 3-5 were used Kalman filtering to other problems. In this paper, we have
in the derivation of this equation, it does not follow that shown that two statistical ideas - fusion of uncertain esti-
Equation 34 is the optimal unbiased linear estimator for com- mates and unbiased linear estimators for correlated variables
bining these two estimates. However, this is easy to show - provide the conceptual underpinnings for Kalman filtering.
using the MSE minimization technique that we have used re- By combining these ideas, standard results on Kalman fil-
peatedly in this paper; assume that the a posteriori estimator tering in linear systems can be derived in an intuitive and
is of the form K 1 xt +1|t +K 2 zt +1 , and find the values of K 1 and straightforward way that is simpler than other presentations
K 2 that produce an unbiased estimator with minimum MSE. of this material in the literature.
In fact, this is the point of departure for standard presen- We believe the approach presented in this paper makes it
tations of this material. The advantage of the presentation easier to understand the concepts that underlie Kalman fil-
given here is that it exposes the general statistical concepts tering and to apply it to new problems in computer systems.
that underlie Kalman filtering.
Most presentations in the literature also begin by assuming ACKNOWLEDGMENTS
that the noise terms wt in the state evolution equation and vt This research was supported by NSF grants 1337281, 1406355,
in the measurement are Gaussian. Some presentations [1, 9] and 1618425, and by DARPA contracts FA8750-16-2-0004
use properties of Gaussians to derive the results in Sections 3 and FA8650-15-C-7563. The authors would like to thank Ivo
although as we have seen, these results do not depend on Babuska (UT Austin), Gianfranco Bilardi (Padova), Augusto
distributions being Gaussians. Ferrante (Padova), Jayadev Misra (UT Austin), Scott Niekum
10
(UT Austin), Charlie van Loan (Cornell), Etienne Vouga (UT 2007). https://doi.org/10.1145/1267070.1267073
Austin) and Peter Stone (UT Austin) for their feedback on [19] Raghavendra Pradyumna Pothukuchi, Amin Ansari, Petros Voulgaris,
this paper. and Josep Torrellas. 2016. Using Multiple Input, Multiple Output
Formal Control to Maximize Resource Efficiency in Architectures. In
Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual Interna-
REFERENCES tional Symposium on. IEEE, 658–670.
[1] Tim Babb. 2015. How a Kalman filter works, in pictures. [20] C.R. Rao. 1945. Information and the Accuracy Attainable in the Esti-
http://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/. mation of Statistical Parameters. Bulletin of the Calcutta Mathematical
(2015). Society 37 (1945), 81–89.
[2] A. V. Balakrishnan. 1987. Kalman Filtering Theory. Optimization [21] Éfren L. Souza, Eduardo F. Nakamura, and Richard W. Pazzi. 2016.
Software, Inc., Los Angeles, CA, USA. Target Tracking for Sensor Networks: A Survey. ACM Comput. Surv.
[3] Allen L. Barker, Donald E. Brown, and Worthy N. Martin. 1994. 49, 2, Article 30 (June 2016), 31 pages. https://doi.org/10.1145/2938639
Bayesian Estimation and the Kalman Filter. Technical Report. Char- [22] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. 2005. Probabilistic
lottesville, VA, USA. Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press.
[4] K. Bergman. 2009. Nanophotonic Interconnection Networks in Multi- [23] Greg Welch and Gary Bishop. 1995. An Introduction to the Kalman
core Embedded Computing. In 2009 IEEE/LEOS Winter Topicals Meeting Filter. Technical Report. Chapel Hill, NC, USA.
Series. 6–7. https://doi.org/10.1109/LEOSWT.2009.4771628
[5] Liyu Cao and Howard M. Schwartz. 2004. Analysis of the Kalman
Filter Based Estimation Algorithm: An Orthogonal Decomposition
Approach. Automatica 40, 1 (Jan. 2004), 5–19. https://doi.org/10.1016/
j.automatica.2003.07.011
[6] Charles K. Chui and Guanrong Chen. 2017. Kalman Filtering: With
Real-Time Applications (5th ed.). Springer Publishing Company, Incor-
porated.
[7] R.L. Eubank. 2005. A Kalman Filter Primer (Statistics: Textbooks and
Monographs). Chapman & Hall/CRC.
[8] Geir Evensen. 2006. Data Assimilation: The Ensemble Kalman Filter.
Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[9] Rodney Faragher. 2012. Understanding the basis of the Kalman filter
via a simple and intuitive derivation. IEEE Signal Processing Magazine
128 (September 2012).
[10] Mohinder S. Grewal and Angus P. Andrews. 2014. Kalman Filtering:
Theory and Practice with MATLAB (4th ed.). Wiley-IEEE Press.
[11] Anne-Kathrin Hess and Anders Rantzer. 2010. Distributed Kalman
Filter Algorithms for Self-localization of Mobile Devices. In Proceedings
of the 13th ACM International Conference on Hybrid Systems: Compu-
tation and Control (HSCC ’10). ACM, New York, NY, USA, 191–200.
https://doi.org/10.1145/1755952.1755980
[12] Connor Imes and Henry Hoffmann. 2016. Bard: A Unified Framework
for Managing Soft Timing and Power Constraints. In International
Conference on Embedded Computer Systems: Architectures, Modeling
and Simulation (SAMOS).
[13] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann. 2015. POET:
A Portable Approach to Minimizing Energy Under Soft Real-time
Constraints. In 21st IEEE Real-Time and Embedded Technology and
Applications Symposium. 75–86. https://doi.org/10.1109/RTAS.2015.
7108419
[14] Rudolph Emil Kalman. 1960. A New Approach to Linear Filtering
and Prediction Problems. Transactions of the ASME–Journal of Basic
Engineering 82, Series D (1960), 35–45.
[15] Steven M. Kay. 1993. Fundamentals of Statistical Signal Processing:
Estimation Theory. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
[16] Anders Lindquist and Giogio Picci. 2017. Linear Stochastic Systems.
Springer-Verlag.
[17] Kaushik Nagarajan, Nicholas Gans, and Roozbeh Jafari. 2011. Mod-
eling Human Gait Using a Kalman Filter to Measure Walking Dis-
tance. In Proceedings of the 2nd Conference on Wireless Health (WH ’11).
ACM, New York, NY, USA, Article 34, 2 pages. https://doi.org/10.1145/
2077546.2077584
[18] Eduardo F. Nakamura, Antonio A. F. Loureiro, and Alejandro C. Frery.
2007. Information Fusion for Wireless Sensor Networks: Methods,
Models, and Classifications. ACM Comput. Surv. 39, 3, Article 9 (Sept.
11
8 APPENDIX: RELEVANT RESULTS FROM
STATISTICS ∂ aT X b
= abT (37)
8.1 Basic probability theory ∂X
∂ aT X T X b
For a continuous random variable x, a probability density = X (abT + baT ) (38)
function (pdf) is a function p(x) whose value provides a rel- ∂X
ative likelihood that the value of the random variable will Proof. We sketch the proofs of both parts below.
equal x. The integral of the pdf within a range of values is • Equation 37: In this case, f (X ) = aTX b.
∂f (X ) T
∂X (i, j) = a(i)b(j) = (ab )(i, j).
the probability that the random variable will take a value
within that range.
• Equation 38: In this case, f (X ) = aTX TX b = (X a)TX b,
If д(x) is a function of x with pdf p(x), the expectation Õm Õn n
 Õ
E[д(x)] is defined as the following integral: X (i, k)∗a(k) ∗ X (i, k)∗b(k) .

which is equal to
i=1 k =1 k =1
n n
∂f (X )
Õ Õ

= a(j)∗ X (i, k)∗b(k)+b(j)∗ X (i, k)∗a(k).

Therefore
E[д(x)] = д(x)p(x)dx ∂X (i, j)
−∞ k =1 k =1
It is easy to see that this is the same value as the (i, j)th
By definition, the mean µ x of a random variable x is E[x]. T
element of X (ab + baT ).
The variance of a discrete random variable x measures the

variability of the distribution. For the set of possible values
of x, variance (denoted by σx2 ) is defined by σx2 = E[(x − µ x )2 ]. 8.3 Proof of Theorem 4.1
Σ(x i −µ x )2
If all outcomes are equally likely, then σx2 = n . The Theorem 4.1, which is reproduced below for convenience,
standard deviation σx is the square root of the variance. can be proved using matrix derivatives.
The covariance between random variables x 1 : p1 ∼(µ 1 , σ12 )
and x 2 : p2 ∼(µ 2 , σ22 ) is the expectation E[(x 1 −µ 1 )∗(x 2 −µ 2 )]. Theorem. Let pairwise uncorrelated estimates xi (1≤i≤n)
This can also be written as E[x 1 ∗x 2 ] − µ 1 ∗µ 2 . drawn from distributions pi (x)=(µµ i , Σi ) be fused using the lin-
ear model yA (x1 , .., xn ) = ni=1 Ai xi , where ni=1 Ai = I . The
Í Í
Two random variables are uncorrelated or not correlated if
their covariance is zero. MSE(yA ) is minimized for
Covariance and independence of random variables are n
Õ
different but related concepts. Two random variables are Ai = ( Σ−1
j ) Σi .
−1 −1

independent if knowing the value of one of the variables j=1


does not give us any information about the value of the Proof. To use the Lagrange multiplier approach
other one. This is written formally as p(x 1 |x 2 =a 2 ) = p(x 1 ). Í of Sec-
tion 3.2 directly, we can convert the constraint ni=1 Ai = I
Independent random variables are uncorrelated but ran- into a set of m 2 scalar equations (for example, the first equa-
dom variables can be uncorrelated even if they are not inde- tion would be A1 (1, 1) + A2 (1, 1) + .. + An (1, 1) = 1), and then
pendent. Consider a random variable u : U that is uniformly introduce m 2 Lagrange multipliers, which can denoted by
distributed over the unit circle, and consider random vari- λ(1, 1), ...λ(m, m).
ables x : [−1, 1] and y : [−1, 1] that are the x and y coor- This obscures the matrix structure of the problem so it is
dinates of points in U . Given a value for x, there are only better to implement this idea implicitly. Let Λ be an m×m
two possible values for y, so x and y are not independent. matrix in which each entry is one of the scalar Lagrange
However, it is easy to show that they are not correlated. multipliers we would have introduced in the approach de-
scribed above. Analogous to the inner product of vectors, we
8.2 Matrix derivatives can define the inner product of two matrices Í Íasm <A, B> =
If f (X ) is a scalar function of a matrix X , the matrix deriva- trace(AT B) (it is easy to see that <A, B> is m i=1 j=1 A(i, j)B(i, j)).
∂f (X ) Using this notation, we can formulate the optimization prob-
tive ∂X is defined as the matrix
lem using Lagrange multipliers as follows:
∂f (X ) ∂f (X )
∂X (1,1) ... ∂X (1,n) n n
Õ Õ
... ... ...
© ª
f (A1 , ..., An ) = E (xi − µ i )TAi TAi (xi − µ i ) + Λ, Ai − I
­ ®

∂f (X ) ∂f (X )
­ ®
∂X (n,1) ... ∂X (n,n)
i=1 i=1
Taking the matrix derivative of f with respect to each Ai
« ¬
Lemma 8.1. Let X be a m × n matrix, a be a m × 1 vector, and setting each derivative to
 zero to find the optimal values
b be a n × 1 vector. of Ai gives us the equation E 2Ai (xi − µ i )(xi − µ i )T + Λ = 0.

12
This equation can be written as 2Ai Σi + Λ = 0, which im- □
plies
Λ
A1 Σ1 = A2 Σ2 = ... = An Σn = −
2
Using the constraint that the sum of all Ai equals to the
identity matrix I gives us the desired expression for Ai :
Õn
Ai = ( Σ−1
j ) Σi
−1 −1

j=1

13

S-ar putea să vă placă și