Documente Academic
Documente Profesional
Documente Cultură
IN APPLIED MATHEMATICS
A series of lectures on topics of current research interest in applied mathematics under the direction of
the Conference Board of the Mathematical Sciences, supported by the National Science Foundation and
published by SIAM.
GARRETT BIRKHOFF, The Numerical Solution of Elliptic Equations
D. V. LINDLEY, Bayesian Statistics, A Review
R. S. VARGA, Functional Analysis and Approximation Theory in Numerical Analysis
R. R. BAHADUR, Some Limit Theorems in Statistics
PATRICK BILLINGSLEY, Weak Convergence of Measures: Applications in Probability
J. L. LIONS, Some Aspects of the Optimal Control of Distributed Parameter Systems
ROGER PENROSE, Techniques of Differential Topology in Relativity
HERMAN CHERNOFF, Sequential Analysis and Optimal Design
J. DURBIN, Distribution Theory for Tests Based on the Sample Distribution Function
SOL I. RUBINOW, Mathematical Problems in the Biological Sciences
P. D. LAX, Hyperbolic Systems of Conservation Laws and the Mathematical Theory of Shock
Waves
I. J. SCHOENBERG, Cardinal Spline Interpolation
IVAN SINGER, The Theory of Best Approximation and Functional Analysis
WERNER C. RHEINBOLDT, Methods of Solving Systems of Nonlinear Equations
HANS F. WEINBERGER, Variational Methods for Eigenvalue Approximation
R. TYRRELL ROCKAFELLAR, Conjugate Duality and Optimization
SIR JAMES LIGHTHILL, Mathematical Biofluiddynamics
GERARD SALTON, Theory of Indexing
CATHLEEN S. MORAWETZ, Notes on Time Decay and Scattering for Some Hyperbolic Problems
F. HOPPENSTEADT, Mathematical Theories of Populations: Demographics, Genetics and Epidemics
RICHARD ASKEY, Orthogonal Polynomials and Special Functions
L. E. PAYNE, Improperly Posed Problems in Partial Differential Equations
S. ROSEN, Lectures on the Measurement and Evaluation of the Performance of Computing Systems
HERBERT B. KELLER, Numerical Solution of Two Point Boundary Value Problems
J. P. LASALLE, The Stability of Dynamical Systems - Z. ARTSTEIN, Appendix A: Limiting Equations
and Stability of Nonautonomous Ordinary Differential Equations
D. GOTTLIEB AND S. A. ORSZAG, Numerical Analysis of Spectral Methods: Theory and Applications
PETER J. HUBER, Robust Statistical Procedures
HERBERT SOLOMON, Geometric Probability
FRED S. ROBERTS, Graph Theory and Its Applications to Problems of Society
JURIS HARTMANIS, Feasible Computations and Provable Complexity Properties
ZOHAR MANNA, Lectures on the Logic of Computer Programming
ELLIS L. JOHNSON, Integer Programming: Facets, Subadditivity, and Duality for Group and SemiGroup Problems
SHMUEL WINOGRAD, Arithmetic Complexity of Computations
J. F. C. KINGMAN, Mathematics of Genetic Diversity
MORTON E. GURTTN, Topics in Finite Elasticity
THOMAS G. KURTZ, Approximation of Population Processes
(continued on inside back cover)
Probabilistic
Expert Systems
Glenn Shafer
Rutgers University
Newark, New Jersey
Probabilistic Expert
Systems
SJLHJTL.
Contents
Preface
vii
1
2
3
5
7
10
12
13
14
17
18
20
27
30
35
37
41
44
50
56
63
66
69
69
69
70
vi
CONTENTS
4.4
4.5
Index
Review articles
Other sources
71
73
79
Preface
viii
PREFACE
have added a brief chapter on resources, which gives information on software and
includes an annotated bibliography. I have also added some exercises that will
help the reader begin to explore the problem of generalizing from probability to
broader domains of recursive computation.
The resulting monograph should be useful to scholars and students in artificial
intelligence, operations research, and the various branches of applied statistics
that use probabilistic methods. Probabilistic expert systems are now used in
areas ranging from diagnosis (in medicine, software maintenance, and space exploration) and auditing to tutoring, and the computational methods described
here are basic to nearly all implementations in all these areas.
I wish to thank Lonnie Winnrich, who organized the conference in North
Dakota, as well as the other participants. They made the week very pleasant
and productive for me. I also wish to thank the many students and colleagues,
at the University of Kansas and around the world, who helped me learn about
expert systems in the late 1980s and early 1990s. Foremost among them is
Prakash P. Shenoy, my colleague in the School of Business at the University
of Kansas from 1984 to 1992. I am grateful for his steadfast friendship and
indispensable collaboration.
Augustine Kong and A. P. Dempster, who joined with Shenoy and me in
the early 1980s in the study of join-tree computation for belief functions, were
also important in the development of the ideas reported here. Section 3.1 is inspired by an unpublished memorandum by Kong. Other colleagues and students
with whom I collaborated particularly closely during this period include Khalid
Mellouli, Debra K. Zarley, and Rajendra P. Srivastava.
Special thanks are due Niven Lianwen Zhang, Chingfu Chang, and the late
George Kryrollos, all of whom made useful comments on the 1992 draft of the
monograph.
I would also like to acknowledge the friendship and encouragement of many
other scholars whose work is reported here, especially A. P. Dawid, Finn V.
Jensen, Steffen L. Lauritzen, Judea Pearl, and David Spiegelhalter. The field of
probabilistic expert systems has benefited not only from their energy, intellect,
and vision, but also from their generosity and good humor.
Finally, at an even more personal level, I would like to thank my wife, Nell
Irvin Painter, who has supported this and my other scholarly work through thick
and thin.
CHAPTER 1
Multivariate Probability
This chapter reviews the basic ingredients of the theory of multivariate probability: marginals, conditionals, and expectations. These will be familiar topics
for many readers, but our approach will take us down some relatively unexplored paths. One of these paths opens when we develop an explicit notation
for marginalization. This notation allows us to recognize properties of marginalization that are shared by many types of recursive computation. Another path
opens when we distinguish among probability distributions on the basis of how
they are stored. We distinguish between tabular distributions, which are simply tables of probabilities, and algorithmic distributions, which are algorithms
for computing probabilities. A parametric distribution is a special kind of algorithmic distribution; it consists of a few numerical parameters and a relatively simple algorithm, usually a formula, for computing probabilities from those
parameters.
The most complex topic in this chapter is conditional probability. Our purposes require that we understand conditional probability from several viewpoints,
and we rely on some careful terminology to keep the viewpoints distinct. We
distinguish between conditional probabilities in general, which can stand on
their own, without reference to any prior probability distribution, and posterior probabilities, which are conditional probabilities obtained by conditioning
a probability distribution on observations. And we distinguish two kinds of
tables of conditional probabilities: conditionals and posterior distributions. A
conditional consists of many probability distributions for a set of variables (the
conditional's head)one for each configuration of another set of variables (its
tail). A posterior distribution is a single probability distribution consisting of
posterior probabilities.
In the next chapter, we study how to construct a probability distribution
by multiplying conditional probabilitiesor, more precisely, by multiplying conditionals. When we multiply the conditionals in an appropriate order, each
multiplication produces a larger marginal of the final distribution. This means
that each conditional is a continuer for the final distribution; it continues it from
a smaller to a larger set of variables. The concept of a continuer will help us
minimize complications arising from the presence of zero probabilities, which are
unavoidable in expert systems, where much of our knowledge is in the form of
1
CHAPTER 1
TABLE 1.1
A discrete tabular probability distribution for three variables.
young
middle-aged
old
female
Dem ind
.08
.16
.05
.05
.05
.05
Rep
.08
.05
.05
male
Dem ind
.02
.04
.00
.00
.10
.10
Rep
.02
.00
.10
rules that do not admit exceptions. Continuers will also help us, in Chapter 3,
to understand architectures for recursive computation.
This chapter is about multivariate probability, not about probability in general. Not all probability models are multivariate. The chapter concludes with a
brief explanation of why multivariate models are sometimes inadequate.
1.1.
Probability distributions.
The quickest way to orient those not familiar with multivariate probability is
to give an example. Table 1.1 gives a probability distribution for three variables:
Age, Sex, and Party. Notice that the numbers are nonnegative and add to one.
This is what it takes to be a discrete probability distribution.
We will write QX for the set of possible values of a variable X, and we will
write lx for the set of configurations of a set of variables x. We call QX and Ox
the frames for X and x, respectively. In general, Ox is the Cartesian product of
the frames of the individual variables: f^ = Hxex ^x- In Table 1.1, we assume
that
&Age = {young,middle-aged,old},
^Sex = {male, female},
and
Q Party = {Democrat, independent,Republican}.
Thus the frame ^Age,Sex,Party consists of eighteen configurations:
(young,male,Democrat),(old,male,independent),...
and Table 1.1 gives a probability for each of them. In general, as in this example,
a discrete probability distribution for x gives a probability to every element of
fj x ; abstractly, it is a nonnegative function on lx whose values add to one.
If we add together the numbers for males and females in Table 1.1, we get
marginal probabilities for Age and Party, as in Table 1.2. Adding further, we
get marginal probabilities for Age, as in Table 1.3.
Some readers may be puzzled by the name "marginal." The name is derived
from the example of a bivariate table, where it is convenient and conventional
to write the sums of the row and columns in the margins. In Table 1.4, for
CHAPTER 1
for each configuration c of w. Here x\w consists of the variables in x but not
w and c.d is the configuration of x that we get by combining the configuration c
of w and the configuration d of x \ w. For example, if x = {Age,Sex,Party} and
w {Age,Party}, then x \ u> = {^ex}; if c = (old,Democrat) and d (male),
then c.d = (old,male,Democrat).
The arrow notation emphasizes the variables that remain when we marginalize. Sometimes we use instead a notation that emphasizes the variables we sum
out: P~y is the marginal obtained when we sum out the variables in y. Thus
when x = w U y, where w and y are disjoint sets of variables, and P is a probability distribution on x, both P^w and P~y will represent P's marginal on w.
Though we are concerned primarily with probability distributions, any
numerical2 function / on a set of variables x has a marginal f ^ w for every subset
w of x. The function / need not be nonnegative or sum to one. If w is not
empty, then f ^ w is a function on w.
A numerical function is one that takes real numbers as values. We will consider only
numerical functions in this monograph.
3
In order to understand this equation, we must recognize that the product fg is a function
on x U y. Its value for a configuration c of x U y is given by (fg)(c) f ( c ^ x ) g ( c ^ y ) , where
c^x is the result of dropping from c the values for variables not in x. For example, if / is a
function on {Age,Party} and g is a function on {Sex,Party}, then (/g)(old, male, Democrat) =
/(old, Democrat)(7(male, Democrat).
MULTIVARIATE PROBABILITY
FIG. 1.1. Removing y\x from y leaves x n y; removing y\x from x U y leaves x.
Property 1. If / is a function on y. and u and v are disjoint subsets
of y, the
Property 2. If / is a function on x, and g is a function on y, then
This version of Property 2 makes it clear that we are summing out the same
variables on both sides of the equation (fg)^-x = /(fl^xny)- Summing these
variables out of f g , which is a function on x U y. leaves the variables in x, but
summing them out of g, which is a function on y, leaves the variables in x d y
(see Figure 1.1).
The second version of Property 2 also suggests the following generalization:
Property 3.
We leave it to the reader to derive this property also from equation (1.2).
As we will see in Chapter 3, Properties 1 and 2 are responsible for the possibility of recursively computing marginals of probability distributions given as
products of tables. These properties also hold and justify recursive computation
in other domains, where we work with different objects and different meanings
for marginalization and multiplication. Because of their generality, we call Properties 1 and 2 axioms; Property 1 is the transitivity axiom, and Property 2 is the
combination axiom.
The definition of marginalization, equation (1.2), together with the proofs
of Properties 1, 2, and 3, can be adapted to the continuous case by replacing
summation with integration. We leave this to the reader. We also leave aside
complications that arise if infinities are allowedif the sum or integral is over an
infinite frame or an unbounded function. Our primary interest is in distributions
given by tables, and here the frames are both discrete and finite.
1.3.
Conditionals.
Table 1.5 gives conditional probabilities for Party given Age and Sex. We call
these numbers conditional probabilities because they are nonnegative and each
group of three (the three probabilities for Party given each Age-Sex configuration) sums to one. In other words, the marginal for {Age,Sex}, Table 1.6, consists
of ones.
We call Table 1.5 as a whole a conditional. We call {Party} its head, and
we call {Age,Sex} its tail In general, a conditional is a nonnegative function Q
CHAPTER 1
TABLE 1.5
A conditional for Party given Age and Sex.
young
middle-aged
old
Dem
1/4
1/3
1/3
female
ind
1/2
1/3
1/3
Rep
1/4
1/3
1/3
Dem
1/4
1/5
1/3
male
ind
1/2
1/5
1/3
Rep
1/4
3/5
1/3
TABLE 1.6
The marginal of Table 1.5 on its tail.
young
middle-aged
old
female
1
1
1
male
1
1
1
on the union of two disjoint sets of variables, its head h and its tail t, with the
property that Q^ = 1^, where lt is the function on t that is identically equal to
one.
Two special cases deserve mention. If t is empty, then Q is a probability
distribution for h. If h is empty, then Q = It. We are interested in conditionals
not for their own sake but because we can multiply them together to construct
probability distributions. This is the topic of the next chapter.
Frequently, we are interested only in a subtable of a conditional. In Table 1.5,
for example, we might be interested only in the conditional probabilities for
femalesthe subtable shown in Table 1.7. We call such a subtable a slice. In
general, if / is a table on x and c is a configuration of a subset w of x, then we
write f\w=c for the table on x \ w given by
and we call f\w=c the slice of / on w = c. We leave it to the reader to verify the
following proposition.
PROPOSITION 1.1. Suppose Q is a conditional with head h and tail t, and
suppose w Ct. Then Q\w=c is a conditional with head h and tail t\w.
Table 1.7 illustrates Proposition 1.1; it is a conditional with {Party} as its
head and {^4<?e} as its tail.
We will sometimes find it convenient to generalize the notation for slicing by
allowing the variables whose values we fix to include variables that are outside
the domain of the table and hence have no effect on the result. In general, if / is
a table on x, w is a set of variables, and c is a configuration of iu, then we write
f\w=c for the table on x \ w given by
MULTIVARIATE PROBABILITY
TABLE 1.7
The slice of Table 1.5 on Sex = female.
young
middle-aged
old
Dem
ind
Rep
1/4
1/3
1/2
1/3
1/4
1/3
1/3
1/3
1/3
TABLE 1.8
The marginal of Table 1.1 for Age and Sex.
young
middle-aged
old
1.4.
female
.32
.15
.15
male
.08
.00
.30
Continuation.
or
CHAPTER 1
PROPOSITION 1.3.
and
Moreover,
MULTIVARIATE PROBABILITY
4. A probability distribution is its own unique continuer from the empty set
to its domain.
Proof. Statement 1 follows directly from the definition of marginalization,
equation (1.2).
To prove statement 2, we substitute kg for / in equation (1.9), obtaining
(kg)^x = (kg)^wQ. By the combination axiom, this becomes kg^x = kg^wQ, or
9lx = 9lwQAgain by the combination axiom, P kf implies P^ kf^. Since P
is a probability distribution, P^ = 1, whence k = l/f^. So equation (1.10)
holds. Since f^ is a positive number, equation (1.10) is the unique solution of
equation (1.11); P is the unique continuer of / from 0 to x.
To prove statement 4, substitute P for / in equation (1.11) and again apply
the combination axiom.
Equations (1.6) and (1.9) do not require that Q be a function on x. They
require only that Q's domain, say v, should satisfy x w U v or, equivalently,
x \ w C v C x. In some cases (when the right-hand side of equation (1.8) does
not depend on all the coordinates of c), there is a continuer with a domain v
that is smaller than x. The situation is illustrated in Figure 1.2, where we have
written u\ for w\v, u^ for wHv, and ^3 for v\w. We may say, in this situation,
that u-2 is sufficient for the continuation from w to x; the other variables in w,
those in wi, can be neglected.
If the function / that we are continuing is a probability distribution, then
the idea of sufficiency can be elaborated in terms of the meaning of the probabilities. If we give the probabilities an objective interpretation, then we can say
that once the configuration of u^ is determined, the configuration of u\ will not
affect the determination of the configuration of 143. If we give the probabilities a
subjective interpretation, then we can say that once we know the configuration
of U2, information about the configuration of u\ will not affect our beliefs about
the configuration of 143.
The philosophy of probability that underlies this monograph is neither strictly
objective nor strictly subjective. Instead, it is constructive. We see a probability
distribution as something we deliberately construct in order to make predictions.
Though these predictions may be the best we can do, we need not be fully committed to them as beliefs. And though they should be evaluated empirically,
they need not individually represent stable frequencies. In terms of this constructive interpretation, sufficiency simply means adequacy for prediction. Once
the configuration of u^ is specified, we ignore information about u\ when we
predict u3.
Instead of saying that u<2 is sufficient for the continuation from w to x, we
may say that 113 is independent of u\ given u^. The concept of conditional independence thus defined is mathematically interesting. Its properties include the
symmetry suggested by Figure 1.2: if u^ is independent of u\ given u-2, then u\
is independent of u% given u^ (see Dawid [27], Pearl [8], or Appendix F of Shafer
[9]). Conditional independence is an important concept for both the objective
and subjective interpretations of probability. In the objective interpretation, a
CHAPTER 1
10
Posterior distributions.
Suppose the probability distribution P on x expresses our beliefs about the values
of the variables in x. And suppose we now observe the values of the variables
in a subset w of x\ we observe that w has the configuration c. How should this
change our beliefs about the remaining variables, the variables in x \ w!
The standard answer is that we should change our beliefs by conditioning P
on w = c. This means that we should change our belief that x \ w = d from
Plx\w(d) to
We call this number P's posterior probability for d given c. It exists only if
P^w(c) > 0, but we may suppose that if P^w(c) is zero we will not observe
w = c.
Equation (1.13) defines a whole probability distributiona distribution on
x\w that we may designate by px\w\w=c:
MULTIVARIATE PROBABILITY
11
Equation (1.15) says that p\w=c is equal to the product of P and the function
on w that assigns the value l/P^w(c) to the configuration c and the value 0 to
all other configurations. It follows that p\w=c is proportional to the product
of P and a function on w that assigns 1 to c and 0 to all other configurations.
This point is sufficiently important to merit being stated in symbols. To this
end, we write Iw=c for the function on w that assigns 1 to c and 0 to all other
configurations:
where the fi are tables of reasonable size, but the number of variables involved
altogether is too large to allow the actual computation and storage of the table P.
(It will not be difficult to compute the value of P for a particular configuration,
at least if we know the constant of proportionality. But there may be too many
configurations for us to compute the value of P for all of them.) In this situation,
as we will see, we can often work from the factorization to find marginals for P,
even though we cannot compute P itself. We may also be interested in computing
marginals for posteriors of P, and therefore we will be interested in transforming
12
CHAPTER 1
Suppose
and
1.6.
Expectation.
Most readers will be familiar with the idea of the expectation of a function
V on x with respect to a probability distribution P on x. This is a number,
usually denoted by EP(V}. In the discrete case, it is obtained by multiplying
corresponding values of P arid V and adding the products. Thus
by
MULTIVARIATE PROBABILITY
1.7.
13
The probability distributions we have been studying are tabular. A tabular distribution is a table that gives a probability for each configuration. We will find
it useful to distinguish tabular distributions from algorithmic distributions. An
algorithmic distribution consists of an algorithm, together possibly with some
numerical information, that enables us to compute the probabilities of individual configurations. Algorithmic distributions can involve more or less complex
algorithms and more or less numerical information. At one extreme are distributions such as the Poisson. which are specified by a single number {the
mean in the case of the Poisson) and a simple formula. At another extreme
are the posterior distributions that arise in Bayesian statistics, which may involve many numbers and complicated algorithms. In the next few chapters,
we will be concerned with an intermediate case; we define a distribution for
a large number of variables as the product of many tables of numbers, each
involving only a few variables. Here there are many numbers but, a simple
algorithm: multiply.
The line between tabular and algorithmic distributions cuts across the line
between discrete arid continuous distributions. A continuous distribution, like a
discrete distribution, can be cither tabular or discrete. In the tabular case, we
store the values of the density at a sufficiently large number of configurations. In
the algorithmic case, we store instead a formula or algorithm that enables us to
compute the value of the density at any configuration. To some extent, the line
also cuts across the line between numerical and categorical variables. (Variables
like Age, Sex, and Party are called categorical, because they have categories
e.g., young, old, and middle-agedrather than numbers as possible values.)
Distributions for categorical variables are usually tabular, but distributions for
numerical variables can be tabular or algorithmic.
When an algorithmic distribution involves only a few numbers, we call the
numbers parameters, and we call the distribution parametric. The distributions
with namesPoisson, multinomial, Gaussian, and so onare parametric.
The terms tabular, parametric, and algorithmic can be applied to conditionals
and other functions as well as to distributions. These terms can help us keep track
of complications involved in finding marginals and continuers of distributions and
in multiplying conditionals. Figure 1.3 shows the main points. When we compute
marginals, we generally stay in the same class of distributions; a marginal of a
table is a table, a marginal of a Gaussian is a Gaussian, and so on. A continuer
or posterior for a tabular distribution is tabular, but only in a few cases (such as
the multinomial and the Gaussian) do continuers or posteriors stay in the same
parametric family as their distributions. Multiplication usually takes us out, of
the class of tabular distributions. Given a collection of tables for the same small
set of variables, we can perform the multiplication to obtain a new table, but
given tables for many different small sets of variables, the size of the frame for
all the variables may prevent us from computing and storing the product we
may have to settle for thinking of the multiplication as an algorithm that allows
us to find the probability for a particular configuration when we want it.
14
CHAPTER 1
FlG. 1.3.
The effect
of computation.
A limitation.
Though the multivariate framework for probability is widely used, it has its
limitations. A principal limitation is that it requires every variable to have a
value no matter how matters come out. This is often appropriate in statistical
work; in our example, every individual has an age and a sex, and we invent the
category "independent" so that every individual will have a party affiliation. It is
less appropriate in expert-system work, where the meaningfulness of a variable
often depends on the values of other variables. A particular medical test or
procedure only has a result if it is carried out, and we carry it out only for
some patients. A particular phoneme has a certain characteristic in the seventh
millisecond only if it lasts that long, and sometimes it may not. "Number of
pregnancies" is applicable only to women, not to men and children. We can
pretend that these variables always have values, but when there are many of
them, this is computationally awkward as well as artificial.
It is one thing to recognize this limitation and another to correct it. The
multivariate framework is flexible as well as expressive, and the obvious alternatives lack much of its flexibility. A tree, for example, allows us to represent
some variables as being meaningful only if others have certain values but allows access to the variables only in a certain order. Consequently, most work in
probabilityboth theory and applicationis carried out within the multivariate
framework, and extensions to the framework are developed and used on a fairly
ad hoc basis.
The graphical models that we will study in the following chapters are squarely
within the multivariate framework. For some ideas about going beyond it, see
Dempster [16] and Chapter 16 of Shafer [9].
MULTIVARIATE PROBABILITY
15
Exercises.
EXERCISE 1.1. Derive the three properties of marginalization listed in 1.2
from equation (1.2).
EXERCISE 1.2. Here are some familiar problems, each with its own concept
of combination and its own concept of marginalization. Discuss, in each case,
how to formalize the problems so that the axioms of transitivity and combination
are satisfied.
1. Systems of equations (or, more generally, systems of constraints
on numerical variables) are combined by pooling and marginalized
(we usually say "reduced") by eliminating variables.
2. Linear programming problems can be combined by adding (or
perhaps multiplying) their objective functions and pooling their constraints. They can be reduced by maximizing their objective functions
over variables that are eliminated.
3. Discrete belief functions are combined by Dempster's rule and
marginalized by restricting the events for which beliefs are demanded.
(One formalization is provided by Shafer, Shenoy, and Mellouli [45]
and another by Shenoy and Shafer [48].)
In which of these problems do continuers exist?
EXERCISE 1.3. Fix a set of variables X, and consider all pairs of the form
( f , V ) , where f is a strictly positive table on some subset x of X, and V is an
arbitrary table on the same set of variables x. Call x the domain of ( f , V ) .
Define multiplication for such pairs by setting
Show that these operations satisfy the axioms of transitivity and combination.
(Compare equation (1.22).) This example, suggested to the author by Robert
Cowell, is relevant to computation in decision theory, where f may represent
a probability distribution and V may represent a utility function.
EXERCISE 1.4. Consider a function f on a set of variables x, together with a
collection hx,xcx of functions on the individual variables in x. For each subset
w of x, let f^w be the marginal on w of the function obtained by multiplying f
by the hx for X not in w. In symbols,
16
CHAPTER 1
CHAPTER
Construction Sequences
Under certain conditions on the heads and tails of a sequence of conditionals, the
product of the conditionals will be a probability distribution. We call a sequence
of conditionals satisfying these conditions a construction sequence.
As we will see, the conditionals in a construction sequence are coritinuers for
the probability distribution obtained by multiplying them together. Initial segments of the sequence produce marginals of this probability distribution. Thus
the construction sequence represents a step-by-step construction of the probability distribution.
After constructing a probability distribution, we may want to find a marginal
for it or one of its posteriors. This may be difficult computationally, especially
if the joint frame of all the variables is too large to permit us to carry out the
multiplication of the conditionals. Were we able to carry out this multiplication,
we could store the resulting table and work directly with it to find marginals.
But if we are obliged to keep the probability distribution stored as a product of
tables, then we must look for less direct methods.
In some cases, as we will see in this chapter, a computationally inexpensive
adaptation of a construction sequence will produce a construction sequence for
the marginal we desire. To obtain the marginal for the variables in an initial
segment of a construction sequence, we need only omit the later factors from the
construction sequence. To obtain the posterior for later variables given values
of the variables in an initial segment, we need only slice the later factors. If the
construction sequence is a chain, then we can find a construction sequence for
the variables in a final segment by a simple forward propagation. The general
case, however, requires the more general methods that we will study in the next
chapter -methods that apply to any distribution stored as a product of tables,
whether or not the tables form a construction sequence.
If each new conditional in a construction sequence involves a single new variable, then the most essential qualitative aspects of the construction sequence
can be represented by a directed acyclic graph (DAG). Such graphs have been
widely used for knowledge acquisition for probabilistic expert systems, and on
the theoretical side, they have been studied as a representation of conditional independence relations (Pearl [8]). Here we emphasize the value of DAGs for representing alternative construction sequencesconstruction sequences that use the
17
18
CHAPTER 2
TABLE 2.1
Qi, a probability distribution for Age. (This is a conditional with an empty tail and with
young
middle-aged
old
.40
.15
.45
TABLE 2.2
Q2, a conditional with Age as its tail and Sex as its head.
young
middle-aged
old
female
4/5
1
1/3
male
1/5
0
2/3
TABLE 2.3
QiQ2, a probability distribution for Age and Sex.
young
middle-aged
old
female
.32
.15
.15
male
.08
.00
.30
same conditionals but order them differently. By bringing these alternative orderings into the picture, a DAG enlarges the number of marginals and posteriors
that we can find by simple manipulations. In the general case, where each new
conditional is allowed to involve more than one new variable, we can similarly
indicate alternative orderings with a bubble graph, which is slightly more general
than a DAG.
2.1.
Multiplying conditionals.
Table 2.1 gives a probability distribution Q\ for Age (its single column adds to
one), and Table 2.2 gives a conditional Q% for Sex given Age (each row adds to
one). When we multiply these two tables, we get Table 2.3, which qualifies as a
probability distribution for Age and Sex (its six entries add to one). Notice that
Qi is a marginal of this probability distribution and hence Qi is a continuer.
We need not carry out the numerical multiplication in order to see that the
product Q\Qi is a probability distribution. We can instead perform an abstract
computation:
CONSTRUCTION SEQUENCES
19
Here we have first broken the summation into a summation over Sex followed
by a summation over Age. Since Qi does not involve Sex, it can be factored out
of the first summation, leaving Qi, which sums to one over Sex because it is a
conditional. This leaves us with the sum of Qi over Age, which is one because
Qi is a probability distribution.
Consider more generally any two conditionals Q\ and Q^. Write ti for the
tail, hi for the head, and di for the domain of Q%. (Recall that dl = ^ U/i z .) Our
example generalizes to the following proposition.
PROPOSITION 2.1. Suppose t\ is empty, t? is contained in d\, and hi is
disjoint from d\.
1. The product Q\Qz is a probability distribution on d\ U di.
2. The conditional Qi is Q\Qi 's marginal on d\.
3. The, conditional Qi continues Q\Qi from d\ to d\ U di.
Proof. Since we do not have symbols for individual variables, we will not use
summations like those in equation (2.1); instead, we will use our notation for
marginalization. We prove statement 1 by writing
Here we have used both the transitivity and the combination axioms.
Since Qi has an empty tail, it is a probability distribution. By the combination axiom,
20
CHAPTER 2
FlG. 2.1. Left: the first tail is empty. The. second tail in contained in the first domain,
and the second head is disjoint from the. first domain. Right: two more head-tail pairs have
been added. Each time, the new tail is contained in the existing domain, and the new head is
disjoint from, it.
and we say that the construction sequence represents this probability distribution. The restrictions on the head tail structure of a construction sequence are
illustrated in Figure 2.1.
Statement 2 of Proposition 2.2 indicates one way that we can exploit a construction sequence. If we are interested only in the variables in di U U di and
not in the remaining variablesthose in /ii+1 U U hnthen we can simply
omit the last n i conditionals from the construction sequence: Q\- Qi is a
construction sequence for the marginal probability distribution on d\ U U ci,.
Another way to exploit a construction sequence is to fix the values of variables
we have observed. If these variables appear at the beginning of the construction
sequence, then this produces a construction sequence for the posterior distribution.
PROPOSITION 2.3. Suppose Qi,---,Qn is a construction sequence. Suppose
1 < i < n. Write d for U"=1/ij, the domain of Q\- Qn, and write i for U*=1 hj,
the domain of Q^ Q,. Suppose c is a configuration o f t . Then
2.2.
CONSTRUCTION SEQUENCES
21
T,\rn.K 2. 1
A conditional jFor Party given .Age.
young
middle-aged
old
Dem
1/4
1/3
1/3
hid
1/2
1/3
1/3
Rep
1/4
1/3
1/3
Notice that if we use instead the conditional Q'3 given by Table 2.4, then we
obtain the same probability distribution PA<;K.Sex,Party'-
Like equation (2.3). equations (2.4) and (2.5) represent one-new-variable-at-atime construction sequences.
When one new variable is added at a time, the head-tail structure of the
construction sequence can be represented by a directed acyclic graph (DAG for
short). This graph has the variables as nodes, and it has arrows to Xi from
each element of $, for i 2 , . . . ,n. We call this graph directed because the
links between the nodes are arrows, and we call it acyclic because there are no
cycles following the arrows.5 (Since the arrows we draw to each Xt are all from
X} with j < i, any path following the arrows always goes in the direction of
increasing indices; it cannot cycle back to a smaller index.) Figure 2.2 shows
DAGs for the construction sequences represented by equations (2.3), (2.4), and
(2.5), respectively. Figure 2.3 shows the DAG for the more complex construction
sequence represented by the equation
The middle graph in Figure 2.2 and the graph in Figure 2.3 both have cycles,
but not cycles following the arrows. The cycle Xi,X3,X/i,Xi in Figure 2.3, for
example, goes against an arrow on its last step.
A belief net is a finite DAG with variables as nodes, together with, for each
node X, a conditional that has X as its head and X's immediate predecessors
5
Some authors prefer the name acyclic directed graph in order to emphasize that only
directed cycles are forbidden; a path that does not always follow the arrows is allowed to be a
cycle. But the name directed acyclic graph and the acronym DAG are strongly established in
the literature.
22
CHAPTER 2
in the DAG as its tail. We have just explained how a construction sequence
determines a belief net. It is also true that the conditionals in a belief net can
always be ordered so as to form a construction sequence. This follows from the
following lemma.
LEMMA 2.1. The nodes of a finite DAG can always be ordered so that each
variable's immediate predecessors in the DAG precede it in the ordering. In other
words, we can find an ordering X\,..., Xn such that the immediate predecessors
of Xi in the DAG are a subset of {X\,...,Xi}. (In particular, Xi has no
predecessors in the DAG.)
Proof. The simplest proof is by induction on n, the number of variables in
the DAG. There is at least one node in the DAG that has no successors; if
every node had a successor, then we could form a cycle by going from each node
to a successor until (because there are only finitely many nodes) we repeated
ourselves. If we choose a node with no successors as Xn, and if we then remove
this node and the arrows to it, then we obtain, a DAG with only n I nodes
which, by the inductive hypothesis, has an ordering Xi,..., Xn-\ satisfying the
condition. The ordering Xi,..., Xn then also satisfies the condition.
We may call an ordering of the nodes of a DAG that satisfies the conditions
of Lemma 2.1 a DAG construction ordering. Unless a DAG is merely a chain,
it has more than one DAG construction ordering. The DAG in Figure 2.3, for
example, has five:
A variety of other names are also in use, including Bayesian network and graphical model.
CONSTRUCTION SEQUENCES
23
Every DAG construction ordering for the DAG of a belief net gives, of course,
an ordering of its conditionals that is a construction sequence for the probability distribution represented by the belief net. Thus the five DAG construction
orderings we just listed produce five construction sequences for the probability distribution in equation (2.6)five ways to permute the Qi and still have a
construction sequence.
We can talk about a belief net representing a probability distribution, without
reference to any particular construction sequence: a belief net represents a probability distribution P if P is equal to the product of the conditionals attached
to its DAG. We can also talk about a DAG by itself representing a probability
distribution: a DAG represents P if by attaching appropriate conditionals we
can make it into a belief net representing Pi.e., if P factors into conditionals
in the way indicated by the DAG.
Considered abstractly, a belief net represents a probability distribution more
concisely than a construction sequence does. It provides the same conditionals,
but it refrains from ordering them completely. For this reason, belief nets are
considered more fundamental than construction sequences in much of the literature on probabilistic expert systems. As a practical matter, however, belief nets
arise from a step-by-step construction that provides a complete ordering, and
we usually preserve this ordering when we store a belief net. Moreover, as we
will see in the next section, there is no practical advantage in considering only
construction sequences that introduce one new variable at a time. So in this
monograph, we take construction sequences as fundamental, and we treat belief
nets as secondary toolstools that help us see alternative orderings for particular one-new-variable-at-a-time construction sequences. In small problems, where
we can actually draw the DAG, it enables us to see alternative orderings at a
glance. In larger problems, the idea of the DAG reminds us of the existence of
alternative orderings.
Marginals and posteriors. From a computational point of view, the alternative construction sequences that we can discern by studying a DAG are important
because they broaden the application of Propositions 2.2 and 2.3. Since we can
apply these propositions to any construction sequence consistent with the DAG,
we can obtain construction sequences for a much larger class of marginals and
posteriors than we can obtain by working with a single construction sequence.
Propositions 2.2 and 2.3 are concerned with initial segments of a construction
sequence. We may also talk about initial segments of a DAG. We say that a set
w of nodes of a DAG is an initial segment of the DAG if all the immediate
predecessors of each element of w are also in w.
LEMMA 2.2. A set w of nodes in a finite DAG is an initial segment of the
DAG if and only if the DAG has a DAG construction ordering X\,..., Xn such
that
for some k.
24
CHAPTER 2
CONSTRUCTION SEQUENCES
25
We call a DAG a chain if its nodes can be ordered, as in Figure 2.4, so that
the first has no immediate predecessors in the DAG and each of the others has
its predecessor in the ordering as its only immediate predecessor in the DAG.
Notice that a chain has only one DAG construction ordering: Xi,... ,Xn is the
unique DAG construction ordering for the chain X\ > - > Xn.
We call a belief net a belief chain if its DAG is a chain. Thus a belief chain
consists of a chain X\ + . . . > Xn and corresponding conditionals Q\,..., Qn.
The first conditional has X\ as its head and an empty tail; the ith conditional
has Xi as its head and _X";_i as its tail. The idea of forward propagation in such
a chain is based on the following lemma.
LEMMA 2.3. In a belief chain,
26
CHAPTER 2
FIG. 2.6. The state graph for the Markov chain in Figure 2.5.
state graph for the Markov chain of Figure 2.5.) In general, we cannot draw a
state graph for a belief chain because the successive variables may have different
frames. Even if the frames are the same, the possible transitions or at least their
probabilities will vary.
In recent years, considerable use has been made of belief nets of a type
slightly more general than Markov chainshidden Markov models. To form a
hidden Markov model, we begin with a Markov chain, say X\ > + Xn, and
from each node Xi we add an arrow to a new node, say Yi, so as to obtain a
DAG as in Figure 2.7. All the Yi have the same frame (possibly different from the
frame for the Xi) and the same conditional. In applications, the Yi are observed,
while the Xi are notthe Markov chain X\ > - - Xn is hidden. We are
interested in rinding posterior probabilities for the Xi, We may, for example,
want to find the most likely configuration of Xi,... ,Xn. Since the Yi do not
form an initial segment of the belief net, we cannot use Proposition 2.4 to find
posterior probabilities for the Xi. But efficient methods for finding posterior
probabilities (and for finding most likely configurations) have been developed in
the literature on hidden Markov models, and these methods, as it turns out, are
special cases of more general methods that we will study in Chapter 3.
Figure 2.7 represents only the simplest type of hidden Markov model; in
practice, the model is elaborated in various ways. One common elaboration
involves attaching more than one observable variable to each X;. There may be
a fixed number of observable variables for each Xit or this number itself may be
an observable variable. In speech recognition, for example, each Xi represents
CONSTRUCTION SEQUENCES
27
Bubble graphs.
Though the visual clarity of belief nets is very attractive, there is no practical reason to limit ourselves to construction sequences involving only one new
variable at a time. All the computational ideas we considered in the preceding
section generalize to the general case, and we can also generalize the graphical
representation itself.
The simplest graphical representation of a general construction sequence is
the bubble graph. This graph has a node for each conditional. This nodecalled
a bubblecontains all the variables in the head and has an arrow to it from each
variable in the tail. Figure 2.8 shows a bubble graph for a construction sequence
for ten variables:
A bubble graph is acyclic in the same sense that a DAG is acyclicwe cannot
go in a cycle following the arrows. Moreover, a bubble graph, like a DAG,
permits us to pick out alternative construction orderings for the nodes i.e.,
alternative construction sequences for the probability distribution. In Figure 2.8,
for example, the bubbles can be ordered in seven different ways:
And hence there are seven ways of ordering the conditionals to form a construction sequence:
28
CHAPTER 2
Marginals and posteriors. In the general case, as in the one-new-variableat-a-time case, we can exploit alternative construction sequences to find prior
marginals for initial segments or posterior marginals given initial segments, and
we can propagate forward in chains to find prior marginals for final segments.
The idea of initial segments is defined for bubble graphs just as for DAGs,
and Proposition 2.4 continues to hold. Translating this proposition into a direct statement about alternative construction sequences, we get the following
generalization of Proposition 2.5.
PROPOSITION 2.6. Suppose Qi----,Qn is a construction sequence for P.
Suppose ii,.... ik is a sequence of distinct integers between 1 and n such that tt~
is empty and ti.. is contained in h^ U U hlj_l for j = 2 , . . . , k. Write w for
h^ U U hik. '
1. Qil,..., Qik is a construction sequence for P^w.
2. Suppose c is a configuration of w. Suppose we modify the sequence
Q},--.,Qn by deleting each Q.L} and by slicing each of the other conditionals
on w = c. Then the result is a construction sequence for P 's posterior given
w = c.
A construction sequence Q i , . . . , Qn is a construction chain if each ti is contained in ht-i for i = 2 , . . . , n. Figure 2.9 shows a bubble graph for a construction
chain: the bubbles are ordered, and each bubble has arrows only from variables
in the preceding bubble.
Lemma 2.3 generalizes as follows.
LEMMA 2.4. Suppose Q\.... ,Qn is a construction chain. Then
CONSTRUCTION SEQUENCES
29
30
CHAPTER 2
We are particularly interested in the marginal of this posterior for the variable
N, which corresponds to an overall judgment that the financial statement is fairly
stated. Since the observed variables do not form an initial segment of the bubble
graph, we cannot find this marginal using the methods we have studied in this
chapter. Instead, we must use the methods of the next chapter, which apply to
arbitrary factorizations.
2.4.
There are a number of alternatives to the bubble graph for representing the headtail structure of construction sequences, including chain graphs (Wermuth and
Lauritzen [50]) and valuation networks (Shenoy [47]). Figure 2.14 shows a chain
graph and Figure 2.15 shows a valuation network corresponding to the bubbl
graph of Figure 2.12. Both types of graph have uses beyond that of-representing
construction sequences. In the chain graph for a construction sequence, all the
CONSTRUCTION SEQUENCES
31
variables in each head are linked with each other, but by omitting some of these
links, we can represent additional conditional independence relations. By varying
the shape of the relational nodes and its arrows in a valuation network, we can
represent a wide variety of relations.
Another more complex graphical representation has been developed by Heckerman [30] under the name similarity network. A similarity network is a tool for
knowledge acquisition; it allows someone constructing a probability distribution
to allow certain variables in a construction sequence to be sufficient for other
variables given some values for earlier variables but not given other values for
these earlier variables.
Exercises.
EXERCISE 2.1. The idea of a construction sequence for a probability distribution generalizes to the idea of a construction sequence for a conditional. In
32
CHAPTER 2
this generalization, we no longer require that the first tail be empty and that each
new tail, be contained in the existing domain. We require only that each new head
be disjoint from the existing domain.
Consider first two conditionals Qi and Qi. Under the hypothesis that hi is
disjoint from d\ (Figure 2.16), prove the following statements:
1. The product Q\Q^ is a conditional with head hi U h% and domain d\ U d-2..
2. The product Qilt2 *s Q\Qi 's marginal on d\ U t%.
3. The conditional Q^ continues Q\Qz from d\ U t? to d\ U d^Then consider a sequence of conditionals Q\,..., Qn. Under the hypothesis that
hi is disjoint from d,\ U Ud,_i for i = 2 , . . . , n, prove the following statements:
1. The product Q\ Qn is a conditional with head h\ U - U hn
and domain d\ U U dn.
2. For i = 2, ...,n, Qi Qi-]l(dlij-udn)\(h,\j-uh.n)isthe
marginal of Qi Qn on (d\ U U dn) \ (ht U U hn).
CONSTRUCTION SEQUENCES
33
FIG. 2.16. Here we ask only that the second head be disjoint from the first domain.
CHAPTER 3
35
36
CHAPTER 3
37
The clusters of variables involved in the tables are shown in Panel 1 of Figure 3.1.
Let us imagine summing the variables out in the reverse of the order in which
they are numbered, keeping track as we go of the new clusters we create.
Summing Xj out yields
38
CHAPTER 3
39
40
CHAPTER 3
FlG. 3.2.
comes was created. The node to which an arrow points always includes all the
variables in the node from which the arrow comes, except the variable that was
summed out. For any particular variable X, any node n containing X must be
connected to the node n' created when X is summed out, because the tables
created as we go downward from n continue to contain X until it is summed out.
It follows that all the nodes containing X are connected in the tree (i.e.form
a subtree), and this is equivalent to the tree being a join tree.
The join tree that we construct is this way is interesting because it can be
interpreted as a picture of the computations involved in the variable-by-variable
summing out. We interpret a node x as a register that can store a table for its
variables, and we interpret an arrow from x to y as an instruction to sum out a
variable from x's table and multiply y's table by the result.
We begin by putting tables in the storage registers; in Figure 3.2, for example,
we put the table /i in 23, the table /2 in 57, the product /3/4 in 1234, and the
table /s in 146. We put tables of ones in the other three nodes. The number
beside each arrow tells us which variable to sum out of the table in the node
preceding the arrow. Figure 3.3 shows the summations we perform when we
follow these instructions.
We summed the variables out in the reverse of the order in which they were
numbered: 7, 6, 5, 4, 3, 2. Figures 3.2 and 3.3 make it clear, however, that
this order can be varied to some extent without changing the join tree or the
computations performed. The only constraint is that we sum out of a given node
only after the node has absorbed messages from all nodes with arrows pointing
to it. Only the three nodes 23, 57, and 146 can begin the computation, 1245 can
act after 57, 124 can act after 1245 and 146, and so on.
We do not need the numbers beside the arrows in Figure 3.2. These numbers
tell us which variable to sum out, but we can also find this information by
comparing the node sending the message to the node receiving it. The sender
always sums out the variable it has that its neighbor does not have. In other
words, it marginalizes to its intersection with the neighbor.
The final result of the computation is f ^ X l , the marginal of / for X\. If we
continue by summing X\ out of this table, then we obtain /^ 0 , the marginal of
/ on the empty set. Figure 3.2 can be extended to include this final summation;
we simply add 0 as a node, with an arrow to it from 1.
41
3.2.
42
CHAPTER 3
Rule 1. Each node waits to send its message to its neighbor nearer
to r until it has received messages from all its other neighbors.
Rule 2. When a node is ready to send its message, it computes the
message by summing out of its current table any variables it has but
the neighbor to whom it is sending the message does not have. (This
was always a single variable in Figure 3.1, but it could be several
variables or none.) In other words, it marginalizes its current table
to its intersection with the neighbor.
Rule 3. When a node receives a message, it replaces its current table
with the product of that table and the message.
Eventually, all the nodes except r will have sent messages, and r will have received a message from each of its neighbors and will have multiplied its original
table by all these messages.
Here is the proposition we need to prove.
PROPOSITION 3.1. At the end of the algorithm just described, the table on r
will be (f>^r, the marginal on r of the product of the initial tables.
Proof. Imagine for the moment that the nodes are peeled away from the join
tree as they send their messages, so that in the end only r remains. Thus a single
step of the algorithm consists of three parts: (1) a node t computes the marginal
of its table to b D t, (2) the neighbor b multiplies this marginal into its current
table, and (3) the node t is removed from the tree. This allows us to state the
following lemma.
LEMMA 3.1. After each step, the product of the tables that remain is the
marginal to the variables that remain of the product of the tables before the step.
To see that Lemma 3.1 is true, write N\ for the set of nodes in the tree before
the step, iV2 for the set of nodes in the tree after the step, and i/)x for the table
in node x before the step. Thus the product of the tables before the step is
rizeAT! ^xi and the product of the tables after the step is (O^eTv ^o;)W 0< (see
Figure 3.4). Since the tree is a join tree, b H t = (UA^) H t. So we find, using the
combination axiom, that
which is a restatement of Lemma 3.1. Lemma 3.1, together with the transitivity
axiom, yields the next lemma.
LEMMA 3.2. After each step, the product of the tables that remain is the
marginal to the variables that remain of the product of the initial tables.
43
FlG. 3.4. The loaded join tree before and after t 'sends its inward message to b.
At the end of the algorithm, we have only one table, the table on the root,
and so we obtain Proposition 3.1 as a special case of Lemma 3.2.
We can gain some further insight into the algorithm by noting that when a
node b receives a message from a neighbor t, it is also receiving, indirectly, information from the nodes on the other side of t. After any step (message-passing
and multiplication) in the algorithm, we can identify the nodes from which a
given node b has received information, either directly or indirectly. These nodes,
together with b itself, form a subtree, which we may call the b's information
branch at that point (see Figure 3.5). The steps we have taken within this subtree are the same as the steps we would have taken had we implemented the
algorithm on it alone, with b as the root. So as a corollary of Proposition 3.1,
we have the following proposition.
PROPOSITION 3.2. After each step, the table on a given node b will be the
marginal on b of the product of the initial tables in b's current information branch.
This is a generalization of Proposition 3.1, because at the end of the algorithm, the root's information branch is the whole tree.
In the course of explaining our algorithm, we have found ourselves talking
about the nodes of the join tree as storage registers and even as individual
processors. Each node can store tables for a certain set of variables, multiply
such tables, and marginalize them. In effect, we have made the join tree, together
with the algorithm, into an architecture for marginalization. We call it the
elementary architecture. In the next few sections, we consider some alternative
architectures, based on the same join tree, that are able to compute marginals
for all the nodes, not merely for a single root node.
Join-tree architectures are potentially applicable to any instance of the general problem of computing marginals of a function given as a product of tables,
as in equation (3.1), but in order to apply a join-tree architecture to such a problem, we first find a join tree that covers the product, one that includes for each
factor a node containing the domain of that factor. (If we want the marginal for a
44
CHAPTER 3
FlG. 3.5. The dashed arrows are those over which messages have already been sent. The
circled subtree is b's information branch at this point.
cluster of variables that is not the domain of one of the factors, then we must
make sure that the join tree also has a node containing this cluster.) Once we
have such a join tree, we place each factor in a node containing its domain. If
a node x receives more than one factor, we multiply them together, and we also
multiply by lx if necessary in order to obtain a table that involves all the variables in x. If a node x does not receive a factor, we simply assign it the table l x .
If the join tree has more than one node containing the domain of a particular
factor, we can put the factor in whichever of these nodes we please. In Figure 3.2,
for example, we have two different nodes that can accept a table on 124. To
minimize computation, we should choose the node with the smaller frame size,
but this is a minor consideration.
The choice of the join tree is much more important. We want a join-tree
cover with nodes small enough to permit computation. If such a join-tree cover
does not exist, we will have to turn to alternative methods for marginalization,
such as Markov-chain Monte Carlo.
As we noted at the beginning of the chapter, there are heuristics that do
produce reasonable choices for join-tree covers. Some of these heuristics do
involve choosing an order for eliminating (summing out) the variables. This not
only produces a join-tree cover; it also determines a placement of the factors in
the join treeeach factor goes as close as possible to the root.
3.3.
The elementary architecture allows us to find the marginal for an arbitrary root
of a join tree. If we then want to find the marginal for another node, we can
use the same join tree, but we must repeat the algorithm using the new node
45
FIG. 3.6. The partial Shafer-Shenoy architecture. Like the elementary architecture, its
finds the marginal for a single root node. In each separator, we have indicated the set of
variables involved in the messages that will be stored there; this is always the intersection of
the two neighboring nodes.
as the root. This usually involves a great deal of duplication. In Figure 3.4, for
example, most of the steps for computing the marginal on w will be the same as
those for computing the marginal on r.
The Shafer- Shenoy architecture provides one way to eliminate much of this
duplication. In this architecture, each node sends messages in all directions. It
is allowed to send its message to a particular neighbor as soon as it has messages
from all its other neighbors. In order that the computations for a message in one
direction should not interfere with those for a message in another direction, a
node no longer replaces its table each time it receives a message. Instead, it keeps
its initial table, stores the incoming messages, and performs multiplications only
as needed for computing outgoing messages.
As a first step in describing the Shafer-Shenoy architecture, we will describe
a partial version, in which, as in the elementary architecture, messages are propagated only to a single root r. Figure 3.6 shows this partial architecture. The
squares on the arrows in this figure are called separators; they contain storage
registers for storing the messages sent in the direction of the arrows. As in the
elementary architecture, we begin with a table <px on each node x and we want
to find (f>^r for a particular node r, where </? is the product of the (px. The storage
registers in the separators are initially empty.
Here are the rules for propagation in the partial Shafer Shenoy architecture:
Rule 1. Each node waits to send its message to its neighbor nearer to
r until it has received messages from all its other neighbors. (More
precisely, it waits until messages have been received by the separators
between it and these other neighbors.)
46
CHAPTER 3
47
FlG. 3.7. The full Shafer-Shenoy architecture. The arrow in each storage register indicates the direction of the message to be stored there.
48
CHAPTER 3
FIG. 3.8. One order in which messages might be sent in the full Shafer-Shenoy architecture.
If the computations are performed serially, there will necessarily be one node,
such as 1 in Figure 3.8, that is the first to receive messages from all its neighbors.
This node can be considered the root. The propagation consists of a pass inward
to the root and another pass back outward. It is not necessary, however, to
specify the root in advance. If the computations are performed in parallel (a
possibility suggested when we talk as if the nodes were individual processors),
then which node is the first to receive all its messages will depend on the pace
of the computations for the different nodes farther out in the tree, and it is even
possible that two nodes will tie for first. This happens in Figure 3.9, where
the computations proceed in parallel and in synchrony, and 124 and 12 receive
messages from each other simultaneously on the third step of the computation.
49
By comparing Figures 3.6 and 3.8, we can understand better why the Shafer
Shenoy architecture stores so many messages. The elementary architecture uses
and discards each message when it is sent. But what would happen if we were
to follow the inward pass of the elementary architecture with an outward pass?
In the case of Figures 3.6 and 3.8, this means that after 1 absorbed the message
from 12, it would send a message back to 12. By the usual rule, the message back
would simply be its current table, which was obtained by multiplying its original
table by the message (no marginalization is needed, because the intersection of
1 with 12 is simply 1). Intuitively, this is the wrong, because it forces 12 to
absorb again the message it just sent, effectively counting it twice. The ShaferShenoy architecture sends instead only the original table, uncontaminated with
the message from 12. It is able to do this because it has kept both its original
table and the message. The same thing happens at each further step on the
outward pass. Node 12, for example, since it still has both its original table and
the messages from 23 and 1, is able to send a message back to 124 that is not
contaminated with the message it received from 124.
Roughly speaking, the Shafer Shenoy architecture computes marginals for
all the nodes at about three times the price for a single marginal. We double
the computation because we compute two messages instead of one for each link,
50
CHAPTER 3
and then we increase it by about the same amount again when we do the final
multiplications to get the marginal for each node. This contrasts with repeating the elementary architecture for each node, which multiplies the amount of
computation for a single marginal by the number of nodes.
Unfortunately, the Shafer-Shenoy architecture is still rather wasteful in its
demand for multiplication. Each node computes a message for each of its neighbors only once (in contrast to what happens if we use the elementary architecture
over and over), but the multiplication a node performs to compute the message
to one neighbor still duplicates much of the multiplication it performs to compute the message to another. In Figure 3.7, for example, node 124 will multiply
its original table by the message from 1245 once when it sends its message to
146 and again when it sends a message to 12. With yet more storage, we could
reduce this remaining duplication somewhat, but it is more effective to take another tack. Instead of trying to keep the message a node sends on the inward
pass from being included in the message it gets back, we can allow for the message's later return by dividing the it out of the node's current table as it is sent.
This is the tack taken by the Lauritzen-Spiegelhalter architecture.
3.4.
51
FIG. 3.10. Rules for the Lauritzen-Spiegelhalter architecture. The message, In or Out, is
always the marginal of the sender's current table to the sender's intersection with the receiver.
52
CHAPTER 3
LEMMA 3.4. At the end of the inward pass, the table on r is (p^T.
Now consider the outward pass. On the outward pass, each node except the
root receives just one message: the message from its inward neighbor. The root
itself sends messages but does not receive any. So the table on the root does not
change, and each of the other tables changes exactly once, when it is multiplied
by the message from its inward neighbor. Since the propagation moves outward
from the root, Proposition 3.5 follows by induction from Lemma 3.4 together
with the following lemma.
LEMMA 3.5. Suppose w has (p^w as its table when it sends its message to outward neighbor x. Then after absorbing the message, x will have (p^x as its table.
To prove Lemma 3.5, we need a formula for the message w sends to x.
LEMMA 3.6. If w has (p^w as its table when it sends its message to outward
neighbor x, then the message it sends is the product of the Shafer-Shenoy messages in both directions: mu,^xmx^w.
To prove Lemma 3.6, we note that by its hypothesis and equation (3.3), the
table on w is
53
marginal of (p on w as its table before sending the message, and it computes the
message by marginalizing this table to w fl x.
Using continuers. The alert reader will have noticed that we glossed over the
problem of zero probabilities in our description of the Lauritzen Spiegelhalter
architecture. If the table mx^w has zero values, then we will not be able to
perform the division in equation (3.4). Fortunately, it is not really necessary to
perform this division. The reasoning with which we proved Proposition 3.5 will
work if we can find a continuer, say Qxnw-^xj of (px HneAf \w mn^x from x PI w
to x, for we can use Qxr\w-+x as x's table after it has sent its message inward
to u>, and this will have the same effect as the division. When the message
mw-+xmx^w comes back, we obtain
54
CHAPTER 3
as our table on x, so that Lemma 3.3 and Proposition 3.5 still hold.
The requirement that continuers should exist makes the LauritzenSpiegelhalter architecture slightly less general than the Shafer-Shenoy architecture, which allows negative entries in the tables (px. Continuers may fail to exist
when negative values are allowed. But if the product of the <px is proportional
to a probability distribution, then we can take it for granted that all the entries
are all nonnegative, because dropping minus signs will not change the product.
And, in this case, continuers exist by Proposition 1.1.
Notice the other implication of Proposition 1.1: we can choose the continuers
to be conditionals. More precisely, we can choose the continuer Qxr\w>x to be a
conditional with head x \ w and tail x n w.
When we look beyond probability to other problems satisfying the transitivity and combination axioms (see the exercises at the end of Chapter 1 and at the
end of this chapter), we find that the Shafer-Shenoy and Lauritzen-Spiegelhalter
architectures have overlapping but distinct ranges of application. The ShaferShenoy architecture works whenever there are no restrictions on multiplication
and marginalization, even if continuers do not exist. The Lauritzen-Spiegelhalter
architecture, on the other hand, can sometimes work under restrictions on
multiplication or marginalization that prevent the use of the Shafer-Shenoy
architecture.
The new construction sequence. One interesting feature of the LauritzenSpiegelhalter architecture is that the product of the tables on the nodes remains
equal to (p during the inward pass. This is clear when we divide: each time we
divide one of the tables by a message, we multiply another by the same message,
so the product does not change. It is equally clear in terms of continuers: each
time we factor a table into a marginal and a continuer and remove the continuer
from the node, we add it as a factor in another node.
Suppose we always choose the continuers to be conditionals. Then at the
end of the inward pass, we have transformed the original factorization of </?,
(p = IlxeAr Vx, into a new factorization,
where w(x) is x's inward neighbor. This new factorization, as it turns out, can
be interpreted as a construction sequence.
In order to make the interpretation as a construction sequence precise, let us
take one more step, continuing the inward pass, as it were, from r to the empty
FIG. 3.12.
55
set 0. In other words, we factor the marginal <^r into the product of (/^0 and
a continuer from 0 to r. Since (p is proportional to a probability distribution
P, (p^ ^ 0, and hence the continuer is unique; it is the marginal P^r. So
equation (3.6) becomes
If we imagine the a node 0 added to the join tree, with an arrow to it from r,
then at the end of the inward pass, we have the factors on the right-hand side
of equation (3.7) on the nodes of the tree (see Figure 3.12).
By Proposition 1.3, the probability distribution P is equal to (p/tp^. So
equation (3.7) tells us that
It is the conditionals on the right-hand side of this equation that can be arranged
in a construction sequence for P. Indeed, suppose x i , . . . , xm is an ordering of
the nodes of the join tree that moves outward from the rooti.e., such that x\
is the root and each later Xi is an outward neighbor of one of r c i , . . . , x^-i. (Such
orderings exist in any tree.) Write Qi for QXir\w(xi)-*xii fr z = 2 , . . . ,m. Then
we have the following lemma.
LEMMA 3.7. P^r, Q^-, , Qm is a construction sequence for P.
Proof. Equation (3.8) says that P is the product of P^ r ,Q2, , Qmi and
the union of their heads is clearly equal to TV, the domain of P. So to prove
the lemma, we need only show that the head of each conditional is disjoint from
the domain of the preceding ones. But this is an obvious property of join trees:
whenever we order the nodes in a sequence moving outward from a root, the
intersection of each node Xi with the preceding nodes is always contained in its
inward neighbor w(xi), and hence Xi \w(xi) is disjoint from x\ U- -Uzj-i.
Lemma 3.7 says that at the end of the inward pass, the tables on the nodes
are conditionals, and any outward sequence is a construction sequence.
56
CHAPTER 3
The outward pass of the Lauritzen-Spiegelhalter architecture can be understood in terms of the construction sequences produced by the inward pass. Consider, for example, the action of the outward pass on the path going outward
from the root r to a particular node x (Figure 3.13). It is evident that the
conditionals along this path form a construction chain for the marginal of P
on the variables involved, and the propagation outward in this chain is forward
propagation in the sense of Chapter 2.
3.5.
57
FlG. 3.14. The inward and outward action of the Aalborg architecture between x and its
inward neighbor w. Here ifjx and t^w are the tables on x and w, respectively, just before x
computes its message to w, and ipx and ifr'w are the tables just before w computes a message
to send back. The table on w may have changed one or more times as a result of messages
from other outward neighbors and its own inward neighbor.
Since we are more interested in this marginal than in the Shafer-Shenoy message,
we store it in the separator after we forward its quotient by the old message.
The action of the separator on the inward pass seems different from its action
on the outward pass, but Figure 3.15 shows how to describe it in a way that
makes it similar. Instead of beginning with the separator empty, we begin with
it containing l^nx, a table of ones. Since In is the same as In/lwr\Xi we can
say that here too the separator is sending forward a quotient rather than merely
sending forward the message it receives. Thus we have the uniform action shown
in Figure 3.16; the separator always stores New but sends forward New/ Old.
In summary, the Aalborg architecture uses a rooted tree with a separator
between each pair of nodes. Initially, each node x has a table </?x, and each
separator has a table of ones. The propagation follows these rules:
Rule 1. Each nonroot node waits to send its message to a given
neighbor until it has received messages from all its other neighbors.
Rule 2. The root waits to send messages to its neighbors until it has
received messages from them all.
Rule 3. When a node is ready to send its message to a particular
neighbor, it computes the message by marginalizing its current table
to its intersection with this neighbor, and then it sends the message
to the separator between it and the neighbor.
Rule 4. When a separator receives a message New from one of its two
nodes, it divides the message by its current table Old, sends the quotient New/ Old on to the other node, and then replaces Old with New.
Rule 5. When a node receives a message, it replaces its current table
with the product of that table and the message.
58
CHAPTER 3
FIG. 3.15. If we suppose that the separator begins with a table of ones, then the inward
action is the same as the outward.
FIG. 3.16. The uniform action of the Aalborg architecture: When u sends New to its
neighbor v, the message is intercepted by the separator, which divides it by Old and passes the
quotient on.
Rules 1 and 2 force the propagation to move in to the root and then back out.
At the end of the propagation, the tables on all the nodes and separators are
marginals of </?, where ip = Y\ xDealing with zeros. We have again been making the simplifying assumption
that there are no negative or zero values in the <px, so that division is always
possible. Now let us relax this to the assumption that there are no negative
values, which is sufficient for continuers to exist.
When zeros are not allowed in the table Old, the quotient New/ Old is the
unique solution ty of the equation Old tp = New. As it turns out, this equation
can still be solved when we allow zeros; the solution is not unique, but it does
not matter what solution we use. So there are two ways we can proceed. We
can stop talking about divisionwe can talk instead about solving the equation
Old ip = New. Or we can extend the definition of division by picking out a
particular solution of the equation Old ty = New and calling it the quotient
New/ Old.
59
We will explore both approaches. First, let us see what happens when we
drop talk about division. Since division appears only in Rule 4, all we need to
do is replace that rule with the following rule:
Rule 4'. When a separator containing Old receives a new message,
say New, it solves the equation
for ip and sends tp on to its other node. It then discards Old and
stores New in its place.
As the following proposition shows, this works; it is always possible to solve
equation (3.9), and doing so produces the result we want.
PROPOSITION 3.6. If there are no negative values in the initial tables on the
nodes, then propagation under Rules 1,2,3,4', and 5 will result in each node and
separator containing its marginal of (p.
Proof. Since the propagation proceeds inward just as in the elementary architecture, the root will have its marginal at the end of the inward pass. So we can
prove the proposition by induction on the outward pass. Suppose propagation
to w on the outward pass has resulted in the table (p^w on w, and let us show
that the next step will produce (p^x on ID'S outward neighbor x. On the inward
pass, x had sent in mx^w, and w now sends back (p^xC]w, or mx-+wmw-+x. So
equation (3.9) can be rewritten as
or
Equation (3.11) obviously has a solution, but it may have more than one. We
need to show that any solution will produce the marginal on x when it multiplies
the table now on x. To this end, let Qxt^w-^x De a Lauritzen Spiegelhalter continuer for x. The current table on x is Qxr\w-*xmx->wi so the result of multiplying
it by any solution of equation (3.10) is
60
CHAPTER 3
In the case at hand, we want to divide one table by another of the same
size, but with an eye to further developments, let us consider a more general
situation, where we want to divide one table by another of the same or possibly
smaller size. Say we want to divide a table B on y by a table A on x, where
x C y. We will show how to do so under the assumption that whenever an entry
in A is zero, everything in the corresponding row in B is zeroi.e.,
or, equivalently,
We will say that A supports B when this condition is met. Given a table A on
x that supports a table B on y, we define a table B/A on y by
Here we have set the value of the quotient equal to zero when the value of
denominator is zero. Any other number would do just as well for our immediate
purpose, but zero will prove convenient later.
This extended definition of division immediately yields the following lemma.
LEMMA 3.8. If A supports B, then
LEMMA 3.9.
1.
2.
3.
4.
5.
6.
61
At each step, we change the table on one node and on one separator. The table
on the node is multiplied by New/Old, and the table on the separator is changed
from Old to New- i.e., it also is multiplied by New/ Old. Since the table on the
node is multiplied by the same factor as the table on the separator, the ratio
This is the Aalborg formula. In words, the function whose marginals we want is
always the ratio of the product of the tables on the nodes to the product of the
tables on the separators.
The Aalborg formula still holds even if zero entries are allowed in our tables,
but the reasoning with which we established it holds only if we plug a couple of
holes.
First, we must check that Hse?^ 8 alwavs supports Ilze/v^-' so tnal' ^ ne
ratio (3.16) is defined. To check this, we write x ( s ) for the outward neighbor
of the separator s. Since [/.,. if it is not equal to l s , is a marginal of T x ( s ) , Us
supports Tx(s) (statement 4 of Lemma 3.9). Hence Pises ^ suPPrts FLes ^(s)
(statement 3) and also Tr HseS1 ^() (statement 2), which is equal to Y\xN Tx.
Second, we must check that multiplying the top and bottom of the ratio (3.16) by New/0ldvfi\\ not change it. This follows from statements 6 and 8
of Lemma 3.9, together with the fact that New/Old supports the numerator. We
know that New/ Old supports the numerator because New is a marginal of one
of its factors, and by statement 1 of Lemma 3.9, New/'Old supports whatever
New supports.
62
CHAPTER 3
There is one point of notation that should be clarified in connection with the
Aalborg formula. For simplicity, we have been using a notation that identifies
each node x with a set of variables. We could also identify each separator with a
set of variableswe could say that the separator s between the nodes u and v is
equal to uC\v. It is better, however, to assume that the names of the separators
are distinct from the sets of variables involved, for two or more separators might
involve the same set of variables. (We might have one pair of neighboring nodes
u\ and v\ and another pair 11% and V2 with uiHvi = u^ Pi v-2.) It would burden
our notation unnecessarily for us introduce distinct symbols for the separator
and its set of variables, but the distinction should be kept in mind, even when,
as will happen shortly, we write as if they are the same.
Loading the separators. Though we have presented the Aalborg architecture
under the assumption that the tables on the separators are initially tables of
ones, this assumption too can be relaxed. Suppose we put nonnegative tables
Tx and Us on the nodes and separators in such a way that the table on each
separator supports the tables on the neighboring nodes. Then the denominator
in equation (3.16) supports the numerator. If we set the quotient equal to <p and
propagate by the Aalborg rules, then we have the following proposition.
PROPOSITION 3.8. At the end of the propagation, the tables on the nodes and
separators will be the corresponding marginals oftp.
Proof. By statements 5 and 6 of Lemma 3.9,
where x(s) is the outward neighbor of the separator s. This suggests that we
compare propagation with Us on s arid Tx on x to propagation with ls on s. Tr
on r, and Tx^s->/Us on x ( s ) . Call the former the loaded propagation (because the
separators are loaded at the beginning) and the latter the adjusted propagation
(because the tables on the nodes are adjusted). We know that the adjusted
propagation results in the marginals of (p on all the nodes and separators; let us
show that the loaded propagation gives the same results.
For the moment, we reserve Tx and Us for the initial tables in the loaded
propagation; we write T_Jaded and y]oaded for the current tables in the loaded
propagation and T*dJusted and [/adjusted f or the current tables in the adjusted
propagation. Initially,
and
These equations will hold throughout the inward pass, for if they hold before an
inward step, they hold after it. To see this, write Mx(s^s for the message from
63
the inward loaded message from x(s) is multiplied by Us in comparison with the
inward adjusted message. Since this is the new table for s, equation (3.20) will
still hold. But the loaded propagation divides Us out before sending the message
on to the neighbor w; hence the message multiplied into w is the same in the two
propagations, and the relation between T^oaded and J1djusted (equation (3.19) or
(3.21)) will also be unaffected.
Since the root has the same table at the end of the inward pass in the two
propagations, it sends the same messages back out. So we can complete the
proof by induction on the outward pass. We need only show that if the message
from w out back to s is the same in the two propagations, then the table on x ( s )
will end up the same. But if we write Mw->a for the message from w back to s,
then the table we get on x(s) in the loaded propagation is
3.6.
The three major architectures we have studied in this chapterthe ShaferShenoy, Lauritzen-Spiegelhalter, and Aalborg architecturesmove inward in a
tree and then back outward. How should we organize or program this movement? This is a very general question, for many computations are tree recursive.
But we should take a moment to consider it.
We have described each of the three architectures by giving, along with rules
for what the nodes do, rules for when they are allowed to do it. The simplicity
of this description made it convenient for the theoretical understanding we have
been seeking, but at the programming level, it suggests rather expensive control
64
CHAPTER 3
65
FlG. 3.17. After COLLECT is called outward from the root, messages move inward.
CHAPTER 3
66
FIG. 3.18.
3.7.
Join-tree propagation may or may not succeed in finding marginals of a particular product of tables. It will not succeed if the belief net is so highly connected that no feasible join-tree cover exists. In this case, we may be able to
use approximate rather than exact methods. Presently, the most widely used
approximate methods are Gibbs sampling and its cousinsmethods now collectively called "Markov-chain Monte Carlo." These methods were proposed
for probabilistic expert systems by Pearl [43], but they have been less successful for expert systems than for vision (Geman and Geman [29]) and Bayesian
statistics (Besag et al. [13]). The small or zero conditional probabilities often
encountered in expert systemswhere a priori knowledge is strongertend to
violate the conditions that allow the Markov-chain methods to converge. A recent candidate to fill the gap left by the weakness of Markov-chain methods for
expert systems is mean-field theory, also borrowed from statistical physics (Saul
et al. [44]).
In this chapter, we have discussed only the problem of finding marginals
of probability distributions given as products of tables. In principle, join-tree
propagation is applicable to finding marginals in any other problem in which the
transitivity and combination axioms are satisfied. (Examples are given in the
exercises.) There arc, however, problems in which the axioms are satisfied but
the operations are not feasible. Join-tree propagation depends on marginaliza-
67
68
CHAPTER 3
EXERCISE 3.7. What constraints must be imposed on the placement of conditionals in the nodes of a join tree in order for the results of Shafer-Shenoy
computations to remain within the partial semigroup of conditionals' (See Exercise 2.5.) Explore conditions on the existence of continuers that allow the
Lauritzen-Spieyelhalter architecture to work in this context.
EXERCISE 3.8. In some problems, the mathematical objects that one combines can be embedded in a larger class that comes closer to being a group, so
that the division required by the Aalborg architecture is possible. Discuss the
extent to which this is possible in the examples considered in Exercise 1.2.
CHAPTER
4.1.
Meetings.
Software.
70
CHAPTER 4
The most thorough implementation of the Shafer-Shenoy architecture is Pulcinella. Developed by the IRIDIA research group in Brussels, it handles belief
functions, categorical judgments, and possibility measures as well as probabilities. It is implemented in Common Lisp and is distributed free. Information is
available from IRIDIA's Web site:
http://iridia.ulb.ac.be/pulcinella/
Further information on these and other packages, some commercial and some
free, is available at a Web site maintained by R.ussell Almond:
http://bayes.stat.washington.edu/almond/belief.html
4.3.
Books.
There are now many excellent books on probabilistic expert systems and related
topics.
[1] Bertele, Umberto, and Francesco Briosdii (1972). Nonserial Dynamic Programming. Academic Press. New York. A readable treatment of join-tree computation for decomposable dynamic programming problems.
[2] Diestel, R. (1990). Graph Decompositions. Clarendon Press. Oxford. A general perspective on decompositions of the type exemplified
by join trees, with hints at the diversity of the applied problems that
inspire these decompositions.
[3] Jensen, Finn V. (1996). An Introduction to Bayesian Networks.
University College Press. London. An engaging and readable introduction to probabilistic networks, with an emphasis on construction
and computation within the Aalborg architecture.
[4] Judd, J. Stephen (1990). Neural Network Design and the Complexity of Learning. MIT Press. Cambridge. This interesting and
readable book demonstrates the relevance of join-tree ideas to the
problem of learning in neural networks.
[5] Lauritzen, Steffen L. (1996). Graphical Models. Oxford University Press. London. A superb treatment of probabilistic networks as
models for data, this book marries probabilistic expert systems with
up-to-date statistical methodology. Relatively comprehensive, it covers undirected as well as directed graphs, and continuous (normal)
as well as discrete probability distributions. Its greatest originality
lies in its treatment of mixed cases: chain graphs, which combine
directed and undirected graphs, and models with both discrete and
continuous variables.
[6] Neapolitan, E. (1990). Probabilistic Reasoning in Expert Systems.
John Wiley. New York. This readable book covered the state of the
Review articles.
71
72
CHAPTER 4
discussion). Statistical Science. 10, pp. 1-66. A review of Markovchain Monte Carlo methods, with an emphasis on Bayesian statistical
problems.
[14] Buntine, Wray (1996). A guide to the literature on learning
graphical models. IEEE Transactions on Knowledge and Data Engineering. An excellent review of the problem of selecting graphical
models for probabilistic expert systems on the basis of data.
[15] Charniak, Eugene (1991). Bayesian networks without tears. AI
Magazine. Winter 1991, pp. 50-63. A nontechnical introduction
to belief nets, especially useful for students with limited interest in
mathematical probability theory.
[16] Dempster, A. P. (1971). An overview of multivariate data analysis. Journal of Multivariate Analysis. 1, pp. 316-346. This classic
article includes a discussion of the limitations of the multivariate
framework, limitations still not overcome in the main body of work
in statistics and probabilistic expert systems.
[17] Neal, Radford M. (1993). Probabilistic inference using Markov
chain Monte Carlo methods. Technical Report. Department of Computer Science. University of Toronto. In contrast to Besag et al., this
review emphasizes probabilistic expert systems.
[18] Rabiner, L. R. (1989). A tutorial on hidden Markov models and
selected applications in speech recognition. Proceedings of the IEEE.
77, pp. 257-286. Still one of the best introductions to hidden Markov
models.
[19] Spiegelhalter, David J., A. Philip Dawid, Steffen L. Lauritzen,
and Robert G. Cowell (1993). Bayesian analysis in expert systems
(with discussion). Statistical Science. 8, pp. 219-283. Currently
the best brief overview of the state of the art of probabilistic expert
systems.
[20] Tatman, J. A., and Ross Shachter (1990). Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man,
and Cybernetics. 20, pp. 365-379. This article reviews influence diagrams, which generalize belief nets by including nodes for decisions,
and shows how dynamic programming can be understood within the
framework of influence diagrams.
[21] Xu, Hong, and Robert Kennes (1994). Steps towards an efficient implementation of Dempster-Shafer theory. Advances in the
Dempster-Shafer Theory of Evidence. R. R. Yager, M. Fedrizzi, and
J. Kacprzyk, eds. John Wiley. New York. Pp. 153 174. This article
reviews various ways of making the Shafer-Shenoy architecture as
efficient as possible for belief functions.
4.5.
73
Other sources.
This is not a comprehensive bibliography of the very extensive work on probabilistic expert systems, but it contains the articles and dissertations that have
most engaged the author's attention.
[22] Beeri, Catriel, Ronald Fagin, David Maier, and Mihalis Yannakakis (1983). On the desirability of acyclic database schemes.
Journal of the Association for Computing Machinery. 30, pp. 479513. This very widely cited paper first introduced the idea of a join
tree into the literature on relational databases. It is also responsible
for the name "join tree."
[23] Cano, Jose, Miguel Delgado, and Serafin Moral (1993). An axiomatic framework for propagating uncertainty in directed acyclic
networks. International, Journal of Approximate Reasoning. 8, pp.
253-280. This article extends the axioms for join-tree computation,
discussed in Chapter 1 and in Shenoy and Shafer [48], to computation within directed acyclic graphs, in the style developed in Pearl's
Probabilistic Reasoning in Intelligent Systems [8].
[24] Cooper, Gregory F., and Edward Herskovits (1992). A Bayesian
method for the induction of probabilistic networks from data. Machine Learning. 9, pp. 309-347. An influential exposition of a
straightforward Bayesian approach to choosing and parametrizing a
DAG from data for a given set of variables. The method developed
in this article can be contrasted with the non-Bayesiari methods developed in Spirtes, Glymour, arid Scheines's Causation, Prediction,
and Search [11].
[25] Cowell, Robert G., and A. Philip Dawid (1992). Fast retraction
of evidence in a probabilistic expert system. Statistics and Computing. 2, pp. 37-40. Using out-marginalization (see Exercise 1.4),
this article gives a quick join-tree algorithm for adjusting marginal
probabilities to allow for the omission of previously included observations. The algorithm allows efficient computation of statistics for
monitoring the performance of a belief net.
[26] Cox, David R., and Nanny Wermuth (1993). Linear dependencies
represented by chain graphs (with discussion). Statistical Science. 8,
pp. 204-283. Taking DAGs and chain graphs as a starting point,
this article discusses a wide variety of graphical representations of
multivariate probability distributions.
[27] Dawid, A. Philip (1980). Conditional independence for statistical
operations. Annals of Statistics. 8, pp. 598-617. This pioneering
article studies general properties of conditional independence that
were later studied as axioms by Judea Pearl.
74
CHAPTER 4
75
76
CHAPTER 4
[49] Srivastava, Rajendra P., and Glenn Shafer (1992). Belieffunction formulas for audit risk. The Accounting Review. 67. pp.
249-283. This article discusses the propagation of evidence for financial audits, using belief functions rather than probabilities.
[50] Wermuth, Nanny, and Steffen L. Lauritzen (1990). On substantive research hypotheses, conditional independence graphs, and
graphical chain models (with discussion). Journal of the Royal Statistical Society, Series B. 52, pp. 21 50. This wide-ranging article
includes a good introduction to the uses of cha,in graphs.
[51] Xu, Hong, and Philippe Smets (1996). Reasoning in evidential
networks with conditional belief functions. International Journal of
Approximate Reasoning. 14. pp. 158 185. This article adds a concept
of conditionals to the theory of belief functions and shows how they
can be implemented in join-tree computation.
[52] Zhang, Neviri Liariwen, Runping Qi, and David Poole (1994). A
computational theory of decision networks. International Journal of
Approximate Reasoning. 11, pp. 83-158. This article extends jointree computation to influence diagrams and even to slightly more
general networks; forgetting is allowed.
77
Index
DISTRIBUTE, 64
Aalborg architecture, 56
Aalborg formula, 61
audit evidence, 29
domain. 3
dynamic programming, 36
Bayesian network, 22
Bayesian statistics, 66
belief chain, 25, 33
belief functions, 15
belief net, 21
bubble graph, 27
elementary architecture, 43
expectation, 12
extended division, 60
factorization, 35, 54
four-color problem, 36
frame, 2
categorical variables, 13
chain, 25
chain graph, 30
COLLECT, 64
combination axiom, 5
computational cost, 67
computional cost, 50
conditional, 5, 18
conditional probabilities, 5
conditioning, 10
configuration, 2
constraint propagation, 36
construction chain, 28
construction sequence, 19, 54
constructive interpretation of
probability, 9
continuer, 7, 15, 16, 18, 53
Gibbs sampling, 66
graphical model, 22
head, 5
heuristics, 37
hidden Markov model, 26. 33
independence, 9
information branch, 43
join graph, 29
join tree, 35, 39
cover, 43
heuristics, 37
root, 41
junction tree, 35
Kalman filter, 16, 67
DAG, 21
construction ordering, 22
initial segment, 23
density, 3
directed acyclic graph, 21
lattice, 16
Lauritzen-Spiegelhalter
architecture, 50
linear programming, 15
79
80
marginal, 2, 3, 18
Markov chain, 25
Markov-chain Monte Carlo, 66
mean field theory, 66
multivariate framework, 2, 14
object-oriented computation, 64
out-marginal, 16
parallel computation, 48
parameter. 13
posterior probability, 10
probability distribution, 2
algorithmic, 13
continuous, 3
discrete, 2
parametric, 13
posterior, 10
tabular, 13
with given marginals, 63
recursive computation, 5
recursive dynamic programming, 67
INDEX
relational database, 35
rules, 63
semigroup, 16, 33, 68
SENDMESSAGE, 64