Bayesian System Identification Based On Probability Logic: James L. Beck

STRUCTURAL CONTROL AND HEALTH MONITORING
Struct. Control Health Monit. 2010; 17:825–847

Published online in Wiley Online Library
(wileyonlinelibrary.com). DOI: 10.1002/stc.424
Bayesian system identification based on probability logic
James L. Beck,y
Division of Engineering and Applied Science, California Institute of Technology, Mail Code 104-44,
Pasadena, CA 91125, U.S.A.
SUMMARY
Probability logic with Bayesian updating provides a rigorous framework to quantify modeling uncertainty
and perform system identification. It uses probability as a multi-valued propositional logic for plausible
reasoning where the probability of a model is a measure of its relative plausibility within a set of models.
System identification is thus viewed as inference about plausible system models and not as a quixotic quest
for the true model. Instead of using system data to estimate the model parameters, Bayes’ Theorem is used
to update the relative plausibility of each model in a model class, which is a set of input–output probability
models for the system and a probability distribution over this set that expresses the initial plausibility of
each model. Robust predictive analyses informed by the system data use the entire model class with the
probabilistic predictions of each model being weighed by its posterior probability. Additional robustness to
modeling uncertainty comes from combining the robust predictions of each model class in a set of
candidates for the system, where each contribution is weighed by the posterior probability of the model
class. This application of Bayes’ Theorem automatically applies a quantitative Ockham’s razor that
penalizes the data-fit of more complex model classes that extract more information from the data. Robust
analyses involve integrals over parameter spaces that usually must be evaluated numerically by Laplace’s
method of asymptotic approximation or by Markov Chain Monte Carlo methods. An illustrative
application is given using synthetic data corresponding to a structural health monitoring benchmark
structure. Copyright r 2010 John Wiley & Sons, Ltd.
KEY WORDS: system identification; probability logic; Bayesian updating; robust predictions; model
assessment
1. INTRODUCTION
The usual goal of system identification is to use experimental data from a system to improve
mathematical models of its input–output behavior so that they make more accurate predictions
of the system response (output) to a prescribed, or uncertain, excitation (input). A common
*Correspondence to: James L. Beck, Division of Engineering and Applied Science, California Institute of Technology,
Mail Code 104-44, Pasadena, CA 91125, U.S.A.
y
E-mail: jimbeck@caltech.edu
Received 2 February 2010

Revised 13 July 2010
Copyright r 2010 John Wiley & Sons, Ltd. Accepted 17 September 2010
826 J. L. BECK
approach is to take a parameterized model of a system and then use data from the system to
estimate the value of the uncertain parameter vector in the model which may be done in several
ways: by maximizing the posterior PDF (probability density function) from Bayes’ Theorem to
get the MAP (maximum a posteriori) estimate; by maximizing the likelihood function to get the
MLE (maximum likelihood estimate), which is equivalent to the MAP estimate under a uniform
prior over the parameter space (or some sub-region of it); or by LS (least-squares) output
matching, which is equivalent to the MLE under a joint Gaussian probability model for the
combined prediction and measurement errors, which are defined as the difference between the
measured and model outputs.
There are several conceptual and computational difficulties with parameter estimation:
(1) There are really no true values of the parameters to estimate because any chosen model
gives only an approximation of the real system behavior, which raises questions about
the basis for choosing just one model in the parameterized set of models that
corresponds to the estimated parameter value and using that model to make system
predictions.
(2) The parameter estimate (ML or LS) is often not unique, especially for complex dynamic
systems, and so then there are multiple models with multiple corresponding system
predictions. The common procedure of arbitrarily fixing some of the parameter values so
that the remaining ones can be uniquely estimated may severely bias the predictions.
A more principled approach should be taken that includes all the multiple predictions in
an appropriate way. This issue of non-uniqueness is often referred to as the problem of
identifiability. This has been well studied for modeling of structural dynamic systems in a
probabilistic setting (e.g. [1–7]). For a general system, Beck and Katafygiotis [2,3] define
global system identifiability, local system identifiability and system unidentifiability based
on the data in terms of whether the set of MLEs consists of a single point, discrete points
or a continuum of points in the continuous-valued parameter space, respectively.
Although a MAP estimate is usually unique because of the regularization of the
optimization provided by the prior probability distribution, if the problem is
unidentifiable there are many models around the MAP one that are almost as plausible
and so their predictions, which may differ from those of the MAP model, especially at
unobserved degrees of freedom, should not be lightly dismissed.
(3) No model of the system is expected to give perfect predictions so it is important to
explicitly quantify the uncertain prediction errors. Although the construction of the
likelihood for the data requires the selection of a probability model for the prediction
errors, it is important that this be done in a defensible, principled way.
A rigorous probabilistic framework for system identification that overcomes these difficulties
is available that is based on probability logic (e.g. [2,3,8,9]). In probability logic, the probability
P[b|c] of a proposition (statement) b, given the information in proposition c, is viewed as a
measure of how plausible b is based on c. This is consistent with the Bayesian point of view that
probability represents a degree of belief in a proposition. The axioms of probability logic, which
can be derived by extending binary Boolean propositional logic, provide a calculus for plausible
reasoning in the case of incomplete information [9,10–13]. Kolmogorov’s statement of the
axioms for a probability measure, which are commonly taken as the foundation for applying
probability and which are neutral with respect to the interpretation of probability, can be
Copyright r 2010 John Wiley & Sons, Ltd. Struct. Control Health Monit. 2010; 17:825–847
DOI: 10.1002/stc
BAYESIAN SYSTEM IDENTIFICATION BASED ON PROBABILITY LOGIC 827
viewed as a special case where the propositions refer to uncertain membership of an object in
a set [9].
There are several key ideas in applying probability logic to quantifying modeling uncertainty.
The first is that it gives a meaning for the probability of a model as a measure of the relative
plausibility of the model within a proposed set of models, conditional on specified information.
The second key idea is the fundamental concept of a stochastic system (or Bayesian) model
class. A model class consists of a set of predictive I/O (input–output) probability models for a
system together with a chosen probability distribution, the prior, over this set that quantifies the
initial relative plausibility of each I/O probability model in the set. Any deterministic model of a
system that involves uncertain parameters can be used to construct a model class for the system
by stochastic embedding [9] in which the Principle of Maximum Information Entropy plays an
important role [13,14].
The third key idea is that the fundamental probability models defining a model class are
viewed as representing a state of plausible knowledge about the system conditional on the
available information and not as inherent properties of the system. They quantify uncertainty
when there is missing information, allowing the fruitful application of the probability axioms
without invoking the vague and not provable belief of ‘inherent randomness’, which does not
provide an operational meaning for the probability of a model. All probabilistic predictions of
the system I/O behavior are conditional on the chosen fundamental probability models for the
model class.
The fourth key idea is to use the updated posterior probability distribution for the model
class, based on system data and Bayes’ Theorem, to perform a posterior robust predictive
analysis in which the probabilistic predictions of all the models in the model class are used,
weighted by their posterior probability according to the Total Probability Theorem. For
continuous-valued parameters this involves an integral over the parameter space that usually
cannot be evaluated analytically, nor evaluated numerically in a straightforward way if the
number of parameters is not very small. It can be approximated to any desired accuracy by
using samples from the posterior probability distribution that are generated by a Markov Chain
Monte Carlo method, such as the Metropolis-Hastings, Gibbs sampler or Hybrid Monte Carlo
algorithms. Furthermore, a justification for parameter estimation under certain conditions can
be given by applying Laplace’s method of asymptotic approximation to the robust predictive
integral. This shows that parameter estimation can be viewed as giving an approximation to
robust predictive analysis provided the model class is globally identifiable based on the data (i.e.
there is a unique MLE).
Since there is always uncertainty in which model class to choose to represent a system, one
can also choose a set of candidate model classes and calculate their posterior probability based
on the data by applying Bayes’ Theorem at the model class level. This leads to the fifth key idea
of using the Total Probability Theorem to produce posterior hyper-robust predictive analyses
that combine appropriately the robust predictions of each model class in the set of candidates by
weighing them by the posterior probability of the corresponding model class. If any of the
weighed contributions are negligible compared with the others, the corresponding model
classes can be dropped from the list of candidates, thereby giving a rigorous method for model
class selection. Furthermore, an information-theoretic interpretation shows that the posterior
probability of each model class depends on the difference between a measure of the
average data-fit of the model class and the amount of information extracted from the data by
the model class.
DOI: 10.1002/stc
828 J. L. BECK
2. PROBABILITY LOGIC
2.1. Basic ideas

In probability logic, probability is viewed as a multi-valued conditional logic for plausible
reasoning that extends binary Boolean propositional logic to the case of incomplete
information. The probability P[b|c] is interpreted as the degree of plausibility of proposition b
based on the information in proposition c. We need not know that c is true—it is conditionally
asserted; for example, in system analysis it may state a model for the dynamics of a system but
this does not imply that the model is assumed to be an exact description of the system. If c gives
complete information about b, we know that either c implies that b is true or c implies that b is
false, that is, its negation b (read ‘not b’) is true. This is the domain of Boolean logic and in
probability logic these cases are taken as the extreme values of the degree of plausibility:
P[b|c] 5 1 if c implies b and P[b|c] 5 0 if c implies b. Since the information about b given by c
may be incomplete, the degree of plausibility will, in general, lie between these values, so
P[b|c]A[0,1] and it generalizes c implies b to the case where we are not sure if b is true or not,
given c. If the propositions a and b are logically equivalent given c, then they are assigned the
same degree of plausibility because if a is true, then b is true, and if a is false, then b is false.
This interpretation of probability as a measure of relative plausibility of the various
possibilities conditional on available information is not well known in engineering where there is
a wide-spread belief that probability only applies to aleatory uncertainty (inherent randomness
in nature) and not to epistemic uncertainty (missing information). The physicist E.T. Jaynes [13]
noted that the assumption of inherent randomness is an example of what he called the Mind-
Projection Fallacy: our uncertainty is ascribed to an inherent property of nature, or, more
generally, our models of reality are confused with reality. From a pragmatic point of view,
regardless of one’s philosophical position on this issue, all one can say is that there is not
sufficient information to make perfect predictions or deductions. Furthermore, the issue of
whether the probability axioms apply to plausible reasoning when our information is incomplete
was settled rigorously by the physicist R.T. Cox in a seminal paper in 1946 [10].
Cox asked if one takes this interpretation for P[b|c], what is the appropriate calculus for
evaluating P[b|c], P[a & b|c] and P[a or b|c], which correspond, respectively, to the degree of
plausibility based on c that b is not true, that both a and b are true, and that either a or b
(or both) are true. To derive this calculus, Cox [10,11] first postulated that universal
functions C:[0,1] [0,1]-[0,1] (conjunction function) and N:[0,1]-[0,1] (negation function)
exist such that:
P½a & bjc ¼CðP½aj b & c; P½bjcÞ
P½bjc ¼N ðP½bjcÞ:
One can readily present arguments to confirm the reasonableness of this postulate in terms of
their implied meanings (e.g. see [9]). Cox then showed that consistency with Boolean logic
implies that these universal functions must have the form:
Cðx; yÞ ¼f1 ðfðxÞ fðyÞÞ 8x; y 2 ½0; 1
N ðyÞ ¼f1 ½ð1 fðyÞÞ 8y 2 ½0; 1
where f:[0,1]-[0,1] is a continuous, strictly increasing function with f(0) 5 0, f(1) 5 1.
DOI: 10.1002/stc
Using De Morgan’s Law from Boolean logic, (a or b) is logically equivalent to (a & b),
and so a universal disjunction function D:[0,1] [0,1]-[0,1] can be derived:
P½a or bjc ¼DðP½ajb & c; P½bjcÞ
Dðx; yÞ ¼N ðC½N ðxÞ; N ðyÞÞ 8x; y 2 ½0; 1
Because of the arbitrariness of f in Cox’s theorem, there are many calculi for plausible
reasoning. It is not surprising that there is this arbitrariness in quantifying plausibility since
there is no natural scale for it. Our brains are superb at performing plausible reasoning but we
do not have the inherent ability to quantify how much more plausible one possible outcome is
compared to another; rather, our feeling is one of qualitative intensity. However, this
arbitrariness in the plausibility calculus is of no real consequence since the plausibility measures
are all homomorphic and so all of the calculi are equivalent in content to the case where f is
taken as the identity function on [0,1]. Therefore, without loss of generality, we are led to the
minimal set of probability logic axioms:
For any propositions a, b and c,
P1: P½bjcX0; P2: P½ bjc ¼ 1 P½bjc
P3: P½a & bjc ¼ P½ajb & cP½bjc
Notice that these axioms are essentially a theorem for degree of plausibility based on the axioms
of Boolean logic and Cox’s postulate that universal conjunction and negation functions exist.
Axioms P1-P3 imply the following [9]:
P4 : ðaÞ P½bjb & c ¼ 1; ðbÞ P½bjb & c ¼ 0; ðcÞ P½bjc 2 ½0; 1
P5 : ðaÞ P½ajc & ða ) bÞpP½bjc & ða ) bÞ; ðbÞ P½ajc & ða , bÞ ¼ P½bjc & ða , bÞ
P6: P½a or bjc ¼ P ½ajc1P ½bjc P ½a & bjc
P7: If proposition c states that one; and only one; ofpropositions b1 ; . . . ; bN is true; then:
PN
(a) Marginalization Theorem: P½ajc ¼ P n¼1 P½a & bn jc
(b) Total Probability Theorem: P½ajc ¼ Nn¼1 P½ajbn &c P½bn jc
(c) Bayes’ Theorem: For k 5 1, y, N:
P½ajbk & c P½bk jc

P½bk ja & c ¼ PN
n¼1 P½ajbn & c P½bn jc
2.2. Kolomogorov’s axioms for probability measures

Most readers will be more familiar with the probability axioms as stated by Kolmogorov [15] for
a probability measure P(A) on subsets A of a finite set X as:
K1 : P ðAÞX0; 8A X ; K2 : P ðX Þ ¼ 1
K3 : P ðA [ BÞ ¼ P ðAÞ1P ðBÞ 8A; B X
with A and B disjoint.
These axioms in themselves do not imply any particular interpretation of probability; this
must be imposed by other considerations. Furthermore, they can be derived from the
probability logic axioms by giving an interpretation in terms of degree of plausibility.
DOI: 10.1002/stc
830 J. L. BECK
Let us use the terminology stochastic variable for a variable whose value is uncertain and for
which a probability distribution is used to quantify the relative plausibility of its possible values.
There is no invocation of randomness here; the stochastic variable could be a constant but
unknown parameter in a model. Denote the stochastic variable by x and let X be the set of its
possible values. For any subset A of X, let P(A) denote P[x A A|p] where proposition p states
that xAX and also specifies the probability model for x which quantifies the relative degree of
plausibility of each value of x in X. Then probability logic axiom P1 implies axiom K1 where
b 5 ‘xAA’ and c 5 p while axiom K2 follows from P(X) 5 P[xAX|p] 5 1, where the last equation
follows from P4(a) since p states that xAX. Also, axiom K3 can be derived from P6 [9].
If we use the abbreviated notation P(A|B) 5 P[xAA | xAB&p], then axiom P3 gives:
P ðA \ BÞ ¼P½x 2 A \ Bjp ¼ P½x 2 A &x 2 Bjp
¼P½x 2 Ajx 2 B&p P½x 2 Bjp
¼P ðAjBÞP ðBÞ
Therefore, if P(B) 5 P[xAB|p]6¼0, then:
P ðAjBÞ ¼ P ðA \ BÞ=P ðBÞ
In Kolmogorov’s approach this appears as a definition of conditional probability but in the
probability logic approach all probabilities are conditional from the outset and so the
corresponding result appears as axiom P3.
The probability logic axioms therefore provide a calculus for uncertainty quantification of
stochastic variables. In contrast to the relative frequency interpretation of Kolmogorov’s
axioms, the axioms do not need to be restricted to variables that correspond to ‘random’
physical quantities, a restriction that eliminates their application to model parameters and hence
to robust system predictions that account for modeling uncertainty (the relative frequency and
probability logic approaches to statistics have been contrasted in several publications, e.g.
[12,13,16]). The various axioms and derived theorems of probability logic can be expressed in
terms of propositions about stochastic variables xAX where X is discrete (i.e. it is a countable
set) or X is continuous (i.e. it is a region in Rd ) where the latter case is approached using limits of
the former case (e.g. [9,13]). For each such variable, a probability model must first be selected to
describe the relative plausibility of each of its possible values, using, for example, the Principle
of Maximum Information Entropy [13,14]. Then the various probability axioms and derived
theorems can be used to propagate the uncertainty and explore the consequences of the
probability models. Note that the notation P(A), unlike its equivalent P[xAA|p], does not make
it explicit that the probability depends on the choice of the probability model for the possible
values of xAX, which is a probability function in the case that X is discrete and a PDF, denoted
by p, in the case that X is continuous, i.e. p(x|p)dx 5 P[xAdx|p]. The probability logic approach
involves frequent use of the theorems quoted in P7 where for continuous-valued stochastic
variables the probabilities are replaced by PDFs and the sums are replaced by integrals.
3. STOCHASTIC SYSTEM MODEL CLASSES

3.1. Definition of a model class
In modeling the I/O behavior of a system, one cannot expect any chosen deterministic model to
make perfect predictions and the prediction errors of such a model will be uncertain. This
DOI: 10.1002/stc
motivates the introduction of a stochastic system (or Bayesian) model class M that consists of
two fundamental probability models to describe the uncertain I/O behavior of the system: a set
of possible I/O probability models fpðxju; h; MÞ : h 2 HRNp g and a prior probability model
pðhjMÞdh that expresses the initial probability of each model pðxju; h;MÞ, that is, the prior gives
a measure of the initial relative plausibility of each I/O probability model in the set. Here, u and
x denote the system input and output; for a dynamic system, these vectors consist of discretized
time histories of the excitation and corresponding system response. Each I/O probability model
partially quantifies the uncertainty in the system output by prescribing the relative plausibility of
the possible values of this output conditional on the input u and the parameter vector h. It can
be constructed from a parameterized deterministic I/O model of the system by stochastic
embedding as described later in this section.
3.2. Bayesian updating within a model class

Suppose system data D ¼ fû; xg
^ is available that consists of measured output x^ of the system
and possibly the corresponding system input û. The updated relative plausibility of each I/O
probability model pðxju; h;MÞ in the set defined by a model class M can be quantified by the
posterior PDF pðhjD;MÞ for the uncertain model parameters h 2 YRNp which specify a
particular I/O model within M. Using Bayes’ Theorem:
pðhjD;MÞ ¼ c1 pðDjh; MÞpðhjMÞ ð1Þ
R
where c ¼ pðDjMÞ ¼ Y pðDjh;MÞpðhjMÞdh is the normalizing constant; pðDjh;MÞ as a
function of h is the likelihood function, which expresses the probability of getting data D based
on the PDF pðxju;h;MÞ for the system output given by the model class M; and pðhjMÞ is the
prior PDF specified by M which is chosen to quantify the initial plausibility of each model
defined by the value of the parameter vector h; for example, it can be chosen to provide
regularization of ill-conditioned inverse problems [17,18]. All probabilities are conditional on
the selected model class M. The constant c ¼ pðDjMÞ is called the evidence for the model class
given by data D. Although it is a normalizing constant in (1) and so it does not affect the shape
of the posterior distribution, it has great importance in model class assessment and averaging, as
described later.
Notice that Bayes’ Theorem takes the initial quantification of the plausibility of each model
specified by h in the model class M, which is expressed by the prior probability distribution, and
updates this plausibility by using the information in the data D expressed through the likelihood
function. Note also that the likelihood function should strictly be denoted by pðxj^ ^ u; h;MÞ but
the notation used in (1) is convenient.
3.3. Model class construction by stochastic embedding of deterministic models

Any deterministic model of a system (e.g. a finite-element structural model, a state-space model,
or an ARMAX (auto-regressive moving-average exogenous) model) that defines an implicit or
explicit mathematical relationship q(u,h) between the input u and model output q (where both
are discretized time histories) and that involves uncertain parameters h can be used to construct
a model class M by stochastic embedding. This can be done by introducing the uncertain
prediction-error time history e(e.g. [2,3]), which is the difference between the real system output x
DOI: 10.1002/stc
832 J. L. BECK
and the model output q for the same input, i.e.

x ¼ q1e
Note that e provides a bridge between the behaviors of the deterministic model and the real
system. Then a parameterized probability model for e can be introduced by using the Principle
of Maximum Information Entropy [13,14], which states that the probability model should be
selected to produce the most uncertainty (largest Shannon entropy) subject to parameterized
constraints that we wish to impose; the selection of any other probability model would lead to
an unjustified reduction in the amount of prediction uncertainty. The maximum-entropy
probability model is therefore conservative in the sense that it gives the greatest uncertainty in
the prediction-error time history, and hence in the system-output time history, conditional on
what one is willing to assert about the system.
A simple choice for the probability model for e is produced by choosing the following
constraints during entropy maximization: zero prediction-error mean at each time (any
uncertain bias can be added to q as another uncertain parameter), and a parameterized
prediction-error variance or covariance matrix at each time. The maximum entropy PDF for the
prediction error e over an unrestricted range is then discrete-time Gaussian white noise.
Therefore, the predictive PDF for the system output xi 2 RNo at discrete time ti, conditional on
the parameter vector h, is given by the following Gaussian PDF with the mean equal to the
model output qi ðu; hÞ 2 RNo and with a parameterized covariance matrix ðhÞ 2 RNo No :

1 1 T 1
pðxi ju; h; MÞ ¼ exp ðxi qi Þ ðxi qi Þ ð2Þ
ð2pÞNo =2 jj1=2 2
The predictive PDF for the system output history x over N discrete times is then given by
Y
N
pðxju; h;MÞ ¼ pðxi ju; h;MÞ
i¼1
The stochastic independence exhibited here, which is inherited from the maximum entropy PDF
for the prediction-error time history e, comes from the fact that no joint moments in time are
imposed during the entropy maximization. It is emphasized that in probability logic this
stochastic independence is part of the probability model that is chosen to represent one’s state of
knowledge of the system and is not thought of as an inherent property of the system. It refers to
information independence and should not be confused with causal independence. It is
equivalent to asserting that if the prediction errors at certain discrete times are given, this does
not influence the plausibility of the prediction-error values at other times.
Another choice for stochastic embedding is to introduce prediction errors into the state
vector equation, as well as in the output equation, in a state-space model, again modeled with a
Gaussian PDF based on the Principle of Maximum Information Entropy. This alternative
allows updating of the prediction-error uncertainty at unobserved points in the system, not just
at the measurement points. It is described in more detail later in a state-space model example.
Either choice for the stochastic modeling of the prediction error produces a set of
parameterized I/O probability models fpðxju; h;MÞ : h 2 Yg where the uncertain parameters h
now also include those involved in specifying the probability model for e, such as the prediction-
error variance. To complete the specification of the model class M, a prior distribution pðhjMÞ
is chosen to express the relative plausibility of each I/O probability model pðxju; h;MÞ specified
by the parameter h.
DOI: 10.1002/stc
3.4. Posterior robust predictive analysis using model classes

A useful application of Bayesian model updating is to make robust predictions about
future events based on past observations. Here, as in modern system theory, the robustness is
with respect to modeling uncertainty within the context of a specified model class. Let D
denote data from available measurements on a system. Based on a selected model class M, all
the probabilistic information for the prediction of a vector of future system responses x is
contained in the posterior robust predictive PDF implied by M and given by the Total
Probability Theorem [6]:
Z
pðxjD;MÞ ¼ pðxjh; D;MÞpðhjD;MÞdh ð3Þ
When the discrete response time history x is to be predicted for a specified discrete input time
history u, the first two PDFs in (3) are also conditional on u (it is irrelevant in the third PDF).
The interpretation of (3) is that it is a weighted average of the probabilistic prediction
pðxjh;D; MÞ for each model specified by h 2 Y in model class M, where the weight is given by its
posterior probability pðhjD;MÞdh. If the conditioning on the data D in (3) is dropped so that the
prior pðhjMÞ is used instead of the posterior pðhjD;MÞ, then the integral in (3) gives p(x|M), the
prior robust predictive PDF, which is useful for the robust design of systems. The prior and
posterior robust predictions therefore correspond to a type of integrated global sensitivity
analysis where the probabilistic prediction of each model specified by hAY is considered but it is
weighed by the relative plausibility of the corresponding model according to the prior or
posterior PDF, respectively.
Many system performance measures can be expressed as the expectation of some performance
function g(x) with respect to the posterior robust predictive PDF as follows:
Z
E½gðxÞjD;M ¼ gðxÞpðxjD; MÞdx ð4Þ
Some important special cases are:

(1) g(x) 5 IF(x), which is equal to 1 if xAF and 0 otherwise, where F is a region in the
response space that corresponds to unsatisfactory system performance, then (4) gives the
posterior robust failure probability PðF jD;MÞ;
(2) g(x) 5 x, then (4) gives the posterior robust mean response; and
(3) gðxÞ ¼ ðx E½xjD;MÞðx E½xjD;MÞT then (4) gives the posterior robust covariance
matrix of x.
Robust predictive analyses require the evaluation of multi-dimensional integrals as in (3) and
(4) that cannot usually be evaluated analytically, nor evaluated numerically in a straightforward
way if the number of parameters is not very small. Useful methods to approximate these
integrals are Laplace’s method of asymptotic approximation and stochastic simulation methods
based on various Markov Chain Monte Carlo algorithms.
3.4.1. Approximation based on stochastic simulation. Stochastic simulation methods (e.g. [19,20])
have become popular for evaluating multi-dimensional integrals of the type occurring in (3).
These methods require samples to be generated from the posterior PDF pðhjD;MÞ but
there are several difficulties: (i) the normalizing constant c in Bayes’ Theorem (1) is usually
DOI: 10.1002/stc
834 J. L. BECK
unknown a priori and its evaluation requires a challenging high-dimensional integration

over the model parameter space; and (ii) the high probability content region of pðhjD;MÞ
occupies a much smaller volume than that of the prior PDF, so samples in this region
cannot be generated efficiently by sampling from the prior PDF using direct Monte Carlo
simulation.
Markov Chain Monte Carlo methods can treat these difficulties, such as multi-level
Metropolis-Hastings algorithms with tempering (e.g. [8,21,22]), Gibbs sampler (e.g. [23,24]), and
Hybrid Monte Carlo (or Hamiltonian Markov Chain) simulation (e.g. [24,25]). In these
methods, the probabilistic information encapsulated in pðhjD;MÞ is characterized by the
posterior samples hðkÞ ,k 5 1,2,...,K, and the integral in (3) is approximated by:
1 XK
pðxjD;MÞ pðxjhðkÞ ; D;MÞ ð5Þ
K k¼1
Samples of x can then be drawn from each of the pðxjhðkÞ ; D;MÞ with equal probability.
3.4.2. Justification for parameter estimation based on Laplace’s asymptotic approximation. Laplace’s
method of asymptotic approximation has been used in the past to approximate the robust
predictive integrals in (3) (e.g. [2,3,5,6]). This method requires a non-convex optimization in what is
usually a high-dimensional parameter space, which is computationally challenging, especially when
the model class is not globally identifiable and so there may be multiple global maximizing points.
The importance of Laplace’s asymptotic approximation is that it provides a justification for
the common practice of parameter estimation where just one predictive model in the model class
is selected, provided the model class is globally identifiable based on the data and the amount of
data is not too small, because applied to the integral in (3), it gives [2,3]:
^ D;MÞ
pðxjD;MÞ pðxjh; ð6Þ
where h^ is the MLE or the MAP estimate for the model class based on data D. If the stated
conditions do not apply, then it is better to make system predictions by using stochastic
simulation, as described in the previous subsection, to approximate the full posterior robust
predictive PDF in (3), even though Laplace’s asymptotic approximation can be extended to the
locally identifiable case [2,3] and even to the unidentifiable case [5,6], although only for a very
small number of parameters.
4. MODEL CLASS ASSESSMENT AND AVERAGING USING POSTERIOR

PROBABILITIES
4.1. Bayesian updating over a discrete set of candidate model classes
The plausibility of proposed candidate model classes for a system can be assessed based on their
posterior probability from Bayes’ Theorem (their probability conditional on the system data D)
[8,26–28]. This gives an extremely powerful method for model class assessment. It automatically
implements a quantitative Ockham’s razor in terms of a trade-off between a data-fit measure
and a complexity measure for each model class, as shown later.
DOI: 10.1002/stc
Given a discrete set of candidate model classes M ¼ fMj :j ¼ 1; 2; . . . NM g, the

posterior probability P ðMj jD;MÞ of each model class is given by Bayes’ Theorem at the
model class level:
pðDjMj ÞPðMj jMÞ

PðMj jD; MÞ ¼ ð7Þ
pðDjMÞ
where PðMj jMÞ is the prior probability of each Mj , taken to be 1/NM if the model classes
are considered equally plausible a priori, and pðDjMj Þ is the evidence (sometimes called
marginal likelihood) for Mj provided by the data D; it is given by the Total Probability
Theorem as:
Z
pðDjMj Þ ¼ pðDjh; Mj ÞpðhjMj Þdh ð8Þ
Although h corresponds to different sets of parameters and can be of different dimension for
different Mj , for simpler presentation a subscript j on h is not used since explicit conditioning on
Mj indicates which parameter vector h is involved. Notice that the evidence for Mj given by D is
the normalizing constant in Bayes Theorem (1). Also, (8) can be interpreted as follows: the
evidence gives the probability of the data according to Mj (if (8) is multiplied by an elemental
volume in the data space) and it is equal to a weighted average of the probability of the data
according to each model specified by Mj , where the weights are given by the prior probability
pðhjMj Þdh of each model.
It is important to note that the posterior probability of the model class Mj in (7) is a measure
of its plausibility relative to M, the chosen set of candidate model classes for making system
response predictions; there is no claim, despite statements to the contrary in the model selection
literature, that one of the model classes is assumed to be the ‘correct’ or ‘true’ one, an
implausible assumption to make.
4.1.1. Calculation of the evidence for a model class. The computation of the multi-dimensional
integral in (8) for the evidence is nontrivial. Laplace’s method of asymptotic approximation can
be used when the model class is globally identifiable based on the available data D(e.g. [26,29]),
which, in effect, utilizes a Gaussian PDF centered on the MLE (or MAP estimate) h^ as an
approximation to the posterior PDF pðhjD;Mj Þ:
^ Mj ÞpðhjM
pðDjMj Þ pðDjh; ^ j Þð2pÞNj =2 detðHðhÞÞ
^ 1=2 ð9Þ
where Nj is the number of model parameters (the dimension of h) for the model class Mj and
H(h) is the Hessian matrix of either ln½pðDjh;Mj ÞpðhjMj Þ or ln pðDjh;Mj Þ, depending on
whether the MAP estimate or the MLE is used for h. ^
When the chosen class of models is unidentifiable based on the available data D, only
stochastic simulation methods are practical, especially Markov Chain Monte Carlo methods
for calculating the model-class evidence such as TMCMC [8,22,27] or the stationarity method
in [28].
DOI: 10.1002/stc
836 J. L. BECK
4.1.2. Quantitative Ockham’s razor and its information-theoretic interpretation. By using (1), the
log evidence can be expressed as [8,30]:
Z
pðDjh; Mj ÞpðhjMj Þ
ln pðDjMj Þ ¼ ln pðhjD; Mj Þdh
pðhjD; Mj Þ
Z

¼ ln pðDjh; Mj Þ pðhjD; Mj Þdh
Z
pðhjD; Mj Þ
ln pðhjD; Mj Þdh
pðhjMj Þ

pðhjD; Mj Þ
¼E lnðpðDjh; Mj ÞÞ E ln ð10Þ
pðhjMj Þ
where the expectation is with respect to the posterior pðhjD;Mj Þ. The first term is the posterior
mean of the log likelihood function, which is a measure of the average data-fit of the model class
Mj , and the second term is the Kullback–Leibler information, or relative entropy [31], which is a
measure of the information gain about Mj from the data D and is always non-negative. This
information-theoretic result was first shown by Beck and Yuen [26] for the case of globally
identifiable models where the expectations on the right-hand side of (10) are approximated by
the PDFs evaluated at h.^ Ching et al. [30] extended it to the general case where the model may be
unidentifiable.
If the selection of a model class is based purely on the data-fit term in (10), which for
Gaussian prediction errors is often well approximated by the least-squares error between the
model output and corresponding system data, then more complex models will be preferred over
simpler models. This, however, often leads to over-fitting of the data and the subsequent
response predictions may then be unreliable since they depend too much on the details of the
specific data used for the model updating. The importance of (10) is that it shows rigorously,
without introducing ad hoc concepts, that the log evidence for Mj , which controls the posterior
probability of this model class according to (7), explicitly builds in a trade-off between the data-
fit of the model class and its information-theoretic complexity (the amount of information the
model class takes from the data D). Bayesian updating at the model class level therefore has a
built-in penalty against models that are more complex in this sense.
Comparing the posterior probability of each model class therefore provides a quantitative
Principle of Model Parsimony or Ockham’s razor [29,32], the essence of which has long been
advocated qualitatively, that is, simpler models that are reasonably consistent with the data
should be preferred over more complex models that lead to only slightly better agreement with
the data. Jeffreys [33] discussed a simplicity postulate that gives lower prior probabilities to
more complex model classes and showed how to use posterior probabilities to choose between
two model classes that differed by only one additional uncertain parameter. Also, Box and
Jenkins [34] emphasized the selection of parsimonious models in time-series analysis, while
Akaike’s information criterion AIC [35] and the Bayesian information criterion BIC [36]
provide a simple quantitative expression for this purpose (re-scaled here):
^ j Þ Nj
AICðMj jDÞ ¼ ln pðDjh;M ð11Þ
^ j Þ 1Nj ln N
BICðMj jDÞ ¼ ln pðDjh;M ð12Þ
2
DOI: 10.1002/stc
where N is the number of data-points. However, Muto and Beck [8] and Oh et al. [18] have given
examples that show one must be very cautious in using these simplified criteria for model
selection because they approximate the log evidence in such a way that for fixed data D, the
penalty term for model class complexity depends only on the number of uncertain parameters
Nj, while the correct penalty term in (10) can differ greatly for two model classes with the same
number of uncertain parameters. Rather than using AIC and BIC to assess globally identifiable
model classes, a much better approximation of the log evidence can be calculated from (9).
The evidence for a model class Mj may be sensitive to the choice of its prior pðhjMj Þ. The use
of an excessively diffuse prior for the parameters should be avoided since it will enforce a strong
preference towards simpler models. In fact, if we consider the case where all of the candidate
model classes have the same predictive I/O probability models but differ in their prior PDF with
respect to the model parameters, then Bayesian model class assessment will give lower posterior
probabilities to those model classes with a more diffuse prior, which can be deduced from (7)
and (10). This provides a mechanism for data-based assessment of the choice of the prior PDF
for the selected set of predictive I/O probability models; for example, the method of automatic
relevance determination from machine learning can be viewed from this perspective because it
chooses an optimal prior over a continuous set of candidate model classes that is parameterized
by prior variances (e.g. [17,18]).
4.2. Hyper-robust predictions using model class averaging

Using M ¼ fMj :j ¼ 1; 2; . . . ;NM g to denote again a discrete set of candidate model classes that
is being considered for a system, all of the probabilistic information for the prediction of a
vector of future system responses x is contained in the posterior hyper-robust predictive PDF
based on M, which is given by the Total Probability Theorem as:
X
NM
pðxjD; MÞ ¼ pðxjD; Mj ÞP ðMj jD; MÞ ð13Þ
j¼1
where the posterior robust predictive PDF pðxjD;Mj Þ for each model class Mj , which is given
by (3), is weighted by its posterior probability P ðMj jD;MÞ from (7). Equation (13) is also called
posterior model averaging in the Bayesian statistics literature (e.g. [37]). If any of the
contributions to the sum in (13) are negligible compared with the others, the corresponding
model classes can be dropped from the list of candidates, thereby giving a rigorous method for
model class selection (e.g. [8,26–28]).
Posterior (or prior) robust and hyper-robust predictions are especially important when
calculating the failure probability for the system (that is, the probability of unacceptable
behavior) because for reliable systems these probabilities tend to be very sensitive to the
particular choice of model (e.g. [27]); this sensitivity is alleviated by considering the integrated
robust or hyper-robust failure probabilities.
5. ILLUSTRATIVE EXAMPLE
The theory presented in the previous sections is illustrated using an example taken from the
recent PhD thesis of Cheung [24] which also appeared in [38].
DOI: 10.1002/stc
838 J. L. BECK
5.1. Linear state-space formulation for model classes

Consider a deterministic state-space model of a linear dynamic system:
_ ¼Ac ðhs ÞxðtÞ1Bc ðhs ÞuðtÞ
xðtÞ
yðtÞ ¼Cðhs ÞxðtÞ1Dðhs ÞuðtÞ ð14Þ
xð0Þ ¼x0
For a given system model, Ac, Bc, C and D are specified functions of parameters hs . The
corresponding discrete-time state-space model with a time interval Dt is:
xn ¼Aðhs Þxn1 1Bðhs Þun1 ; n 2 Z1
ð15Þ
yn ¼Cðhs Þxn 1Dðhs Þun ; n 2 f0; Z1 g
where xn 5 x(nDt)2 RNS , un 5 u(nDt)2 RNI and yn 5 y(nDt)2 RNO denote the model state, the
observed system input and the model output at time nDt respectively. The coefficient matrices
A(hs) and B(hs) are related to Ac(hs) and Bc(hs) by choosing:
Aðhs Þ ¼expðDtAc ðhs ÞÞ
ð16Þ
Bðhs Þ ¼A1
c ðhs Þ½Aðhs Þ IBc ðhs Þ
A stochastic system model class may be constructed from the deterministic state-space model by
stochastic embedding, as described in Section 3.3, where the parameters hs for the coefficient
matrices in the discrete-time state-space model are treated as uncertain. For the general case of
stochastic embedding, uncertain prediction-error terms wn and vn are added to the right-hand
side of both the state and output equations in (15):
xn ¼Aðhs Þxn1 1Bðhs Þun1 1wn ; n 2 Z1
ð17Þ
yn ¼Cðhs Þxn 1Dðhs Þun 1vn ; n 2 f0; Z1 g
where the wn and vn at different times are modeled as independent Gaussian PDFs based on the
Principle of Maximum Information Entropy [13]: wnN(0,Q(hw)), vnN(0,R(hv)) and wn and
vn are independent of each other at all times. The covariance matrices Q and R are specified
functions of the uncertain parameters hw, hv, respectively. If the initial conditions are uncertain,
x0 can be treated as an uncertain parameter vector. The special case of stochastic embedding
without the state prediction-error wn is also considered [2,3]:
xn ¼Aðhs Þxn1 1Bðhs Þun1 ; n 2 R1
ð18Þ
yn ¼Cðhs Þxn 1Dðhs Þun 1vn ; n 2 f0; R1 g
Notice that (17) implies two fundamental system probability models, pðxn jxn1 ; un1 ; hs ; hw Þ
and pðyn jxn ; un ; hs ; hv Þ, which completely define the stochastic model of the system dynamics.
The specification of the prior distribution of the uncertain parameters, hs, x0, hw and hv, then
completes the definition of the desired model class M.
Let Xn ¼ ½xT0 xT1 . . .xTn T ; Un ¼ ½uT0 uT1 . . .uTn T ; Yn ¼ ½yT0 yT1 . . .yTn T ; h ¼ ½hTs xT0 hTw hTv T . Given h
and the measured system input time history UN, the predictive PDF for the system output time
history YN can be written as follows:
Y
N
pðYN jhÞ ¼ pðy0 jhÞ pðyn jYn1 ; hÞ ð19Þ
n¼1
DOI: 10.1002/stc
Here, for convenience, the conditioning of the PDF on UN and the model class M is left
implicit, although later when there is conditioning on different model classes, it will be
made explicit. The conditional PDF pðyn jYn1 ;hÞ in (19) is a Gaussian PDF with mean
E½yn jYn1 ;h ¼ ynjn1 and covariance matrix Cov½yn jYn1 ;h ¼ Snjn1 which are given later,
while p(y0|h) is a Gaussian PDF with mean E[y0|h] 5 y0|–1 and covariance matrix
Cov[y0|h] 5 S0|–1 where:
y0j1 ¼ Cðhs Þx0 1Dðhs Þu0 ð20Þ
S0j1 ¼ Rðhv Þ ð21Þ

Thus, p(y0|h) and pðyn jYn1 ; hÞ are given by:

1 1
pðy0 jhÞ ¼ No =2 1=2
exp ðy0 y0j1 ÞT S1
0j1 ðy0 y0j1 Þ ð22Þ
ð2pÞ jS0j1 j 2

1 1 T 1
pðyn jYn1 ; hÞ ¼ exp ðyn ynjn1 Þ Snjn1 ðyn ynjn1 Þ ð23Þ
ð2pÞNo =2 jSnjn1 j1=2 2
and so p(YN|h) in (19) is given by:

" #
1 1X N
pðYN jhÞ ¼ exp ðyn ynjn1 ÞT S1
njn1 ðyn ynjn1 Þ ð24Þ
No ðN 11Þ=2 Q
N 2
1=2 n¼0
ð2pÞ jSnjn1 j
n¼0
For a given h, ynjn1 and Snjn1 can be calculated sequentially from (27) and (28) using the
Kalman filter equations below which come from Bayesian sequential state updating with
x0|0 5 x0 and P0|0 is the zero matrix:
xnjn1 ¼ Aðhs Þxn1jn1 1Bðhs Þun1 ð25Þ
Pnjn1 ¼ Aðhs ÞPn1jn1 Aðhs ÞT 1Qðhw Þ ð26Þ
ynjn1 ¼ Cðhs Þxnjn1 1Dðhs Þun ð27Þ
Snjn1 ¼ Cðhs ÞPnjn1 Cðhs ÞT 1Rðhv Þ ð28Þ
xnjn ¼ xnjn1 1Pnjn1 Cðhs ÞT S1

njn1 ðyn ynjn1 Þ ð29Þ
Pnjn ¼ Pnjn1 Pnjn1 Cðhs ÞT S1

njn1 Cðhs ÞPnjn1 ð30Þ
The posterior PDF of h is then given by (1) where D ¼ Y ^ N , the measurements for the
system output time history YN. The model class resulting from (18) can be viewed as a special
case of (17) where Q(hw) is the zero matrix and thus ynjn1 ¼ Cðhs Þxn 1Dðhs Þun ; Snjn1 ¼ Rðhv Þ
with xn ¼ xnjn1 and xn1 ¼ xn1jn1 related by (25) and no Kalman filtering needs to be
performed.
DOI: 10.1002/stc
840 J. L. BECK
From (17), it can be shown that:

X
n1
yn ¼Cðhs Þ½An ðhs Þx0 1 Ani1 ðhs ÞBðhs Þui
i¼0
ð31Þ
X
n
nj
1Dðhs Þun 1Cðhs Þ A ðhs Þwj 1vn
j¼1
This equation shows that the model classes resulting from (17) and (18) both have the same mean
predicted output, given hs, but for (17), the prediction errors for the system output are accounted
for by the prediction errors in both the state and output vector equations (the last two terms
in (31)). Therefore, measurements of the system output also provide information about the state
prediction errors, which allows more flexibility in treating modeling uncertainties in the response
predictions than when (18) is used to construct the model class; this flexibility is especially useful
for predictions at unobserved degrees of freedom of quantities that are physically different from
the measured quantities. Notice that for the case of (17), given h, the stochastic system output at
any time is stochastically dependent on the outputs at the other times, as seen in (19), in contrast
to the stochastic independence of the stochastic system output for the case of (18).
It is noted that a regular Kalman filter considers the stochastic state-space model in (18) with
fixed hs, hw and hv whereas the proposed model class produces a posterior robust Kalman filter
which explicitly treats modeling uncertainties to give more robust predictions of future
responses based on the procedure described in Section 3.4.
5.2. Application to IASC-ASCE benchmark structure model

The previous theory is illustrated by using the benchmark structure model from the IASC-
ASCE Structural Health Monitoring Task Group. It is a 4-story, 2-bay by 2-bay steel frame
structure shown schematically in Figure 1 (from [39]). Synthetic dynamic data are used based on
the published structural model [39] which consists of 10 s (with a sample interval Dt of 0.004 s) of
the horizontal acceleration in the weak (y) direction: y€ a;j , y€ b;j , j 5 1,y,4, of each floor on the
east and west frames, respectively (Figure 1); the data corresponds to input dynamic excitations
wj at each floor in the y direction and it is contaminated by Gaussian white noise with the noise
level 10% of the maximum over floors of the RMS acceleration responses. The data-generation
model is a 120-DOF (degree-of-freedom) three-dimensional finite element model of a real
benchmark laboratory structure subject to simulated wind excitations generated by Gaussian
white noise processes passed through a 6th order low-pass Butterworth filter with a 100 Hz
cutoff [39]. The number of observed degrees of freedom is No 5 4 and N 5 2500 is the length of
the discrete time history data.
Here we consider a set M ¼ fMi : i ¼ 1; 2g of two candidate model classes with M1
corresponding to the model class derived from (17) and M2 corresponding to the one derived
from (18) so both model classes come from stochastic embedding of the same deterministic
state-space model, thereby allowing an investigation of the effect of adding the prediction errors
in the state vector equation.
5.2.1. Model class M1 . The deterministic dynamic model consists of a 4-DOF linear shear
building model for motion in the y direction with classical damping for the four modes. This
simple model will produce significant errors in the prediction of the system response since the
DOI: 10.1002/stc
Figure 1. Schematic diagram showing the directions of system output measurements and input excitations [39].
data are generated from a more complicated model. The system is assumed to start at rest, so
x0 5 0. The covariance matrix for the prediction errors wn for the state vector equation in (17) is
modeled as a diagonal matrix:
2
sdis I44 O44
Qðhw Þ ¼ ; so hw ¼ ½s2dis s2vel T ð32Þ
O44 s2vel I44
The covariance matrix for the prediction and measurement errors vn for the output vector
equation is also modeled as a diagonal matrix:
Rðhv Þ ¼ s2acc I44 ; so hv ¼ ½s2acc ð33Þ
The output prediction error is due to the model not being an exact representation of the system;
in practice, this error would be the dominant contribution to vn since the measurement noise for
modern sensors is usually negligible in comparison with the model prediction error.
There is a total of 15 uncertain model parameters: lumped mass mj and stiffness kj of each
story and the damping ratio of each mode xj, j 5 1,y,4, and the variances s2dis , s2vel and s2acc
for the displacement, velocity and acceleration prediction errors. The likelihood function
pðYN jh;M1 Þ for h can be obtained using the equations in (24)–(30). The prior PDF for h is
chosen to be a product of independent distributions. The 12 uncertain parameters hs are taken as
dimensionless ratios by dividing the physical parameters mj, kj, xj (mass, stiffness and damping
ratio parameters) by their nominal values; these values for kj and xj are 67.9 M N/m [39] and 1%
respectively, for j 5 1,y,4, and for m1, m2, m3 and m4 they are 3246, 2652, 2652 and 1809 kg,
respectively. A lognormal distribution is chosen for each of these 12 dimensionless parameters
with medians equal to 1.0 and with corresponding COV (coefficient of variation) for the
dimensionless mass, stiffness and damping parameters of 10, 30 and 50%. For the mass
parameters, smaller values of COV are warranted since these parameters can usually be
determined more precisely from the structural drawings than the other model parameters.
Independent uniform prior distributions are selected for the prediction-error variances s2dis ,
svel and s2acc over the interval [0, ðs2dis Þmax ], [0, ðs2vel Þmax ] and [0, ðs2acc Þmax ], respectively, where
2
ðs2acc Þmax is equal to the square of the maximum over floors of the RMS of acceleration data;
ðs2vel Þmax is equal to the square of the maximum over floors of the RMS of the ‘velocity
DOI: 10.1002/stc
842 J. L. BECK
data’ obtained by numerically integrating the acceleration data using the trapezoidal rule; and
ðs2dis Þmax is equal to the square of the maximum over floors of the RMS of the ‘displacement
data’ obtained by numerically integrating the acceleration data twice using the trapezoidal rule.
It is well known that the ‘velocity data’ and ‘displacement data’ obtained by an integration of
the acceleration data can give a very poor estimate of the system velocity and displacement.
Here, these pseudo ‘velocity data’ and ‘displacement data’ are only used to set an upper bound
on s2dis and s2vel . The selected upper bounds for the three prediction-error variances are
conservative without being excessively so. The data RMS are used here to set the uniform priors
because if very large upper bounds are chosen, it will lead to a very large information gain term
in (10) and so give a strong bias towards the simpler model class.
5.2.2. Model class M2 . This model class is the same as M1 except that because Q(hw) is zero, the
uncertain variances s2dis and s2vel are not involved and so h contains only 13 uncertain model
parameters. Also, the likelihood function is simpler and does not require Kalman filtering.
Let yn(hs) denote the output at time tn at the four observed degrees of freedom that is predicted
by the 4-DOF shear building model and y^ n denote the corresponding measured output.
The combined prediction and measurement errors for the system output equation are given by:
vn ¼ y^ n yn ðhs Þ for n 5 0,1,y,N 5 2500, and are modeled as in model class M1 ; in particular,
the 4 4 covariance matrix is diagonal with unknown output prediction-error variance s2acc .
The likelihood function for this model class is therefore:
" #
1 1 X N
pðYN jh; M2 Þ ¼ No ðN 11Þ exp ½^y yn ðhs ÞT ½^yn yn ðhs Þ ð34Þ
ð2ps2 Þ 2 2s2acc n¼0 n
acc
5.2.3. Results for the two model classes. Samples are generated from the posterior PDF
pðhjD;Mi Þ by using an MCMC (Markov chain Monte Carlo) method which is a hybrid
algorithm based on TMCMC [22] and Hybrid MC [24,25]. Table I shows the sample posterior
means (outside the parenthesis) and COV in % (inside the parenthesis) for the normalized
structural parameters hs of the underlying deterministic model (four each corresponding to mj, kj
and xj, j 5 1,y, 4, the mass, stiffness and damping ratio parameters, respectively), and the
variance parameters for the covariance matrices of the state and output prediction errors.
Table I. Posterior means and COV of the uncertain parameters.

M1 M2
Mean normalized mass, stiffness and damping ratios 0.97 (0.5), 0.98 (0.5) 1.12 (0.9), 1.13 (1.0)
(COV in % in parentheses) 0.99 (0.5), 1.07 (0.5) 1.04 (0.9), 1.21 (1.0)
0.76 (0.7), 0.94 (0.6) 0.81 (0.9), 1.10 (1.0)
0.90 (0.7), 0.92 (0.5) 1.03 (0.9), 0.95 (0.9)
1.11 (14.8), 1.42 (6.9) 0.88 (2.7), 0.86 (1.6)
1.89 (4.9), 1.23 (7.2) 0.86 (1.4), 1.40 (2.1)
s2dis ðmÞ 5.80 1011 (3.7) Not applicable
s2vel ðm=sÞ 2.26 106 (10.1) Not applicable
s2acc ðm=s2 Þ 0.103 (2.4) 3.26 (1.4)
DOI: 10.1002/stc
For both model classes, the posterior COV for the parameters related to the damping ratio are
larger than those related to the mass and stiffness parameters, showing that there is a larger
uncertainty in the damping parameters, as expected.
The exact measurement noise variance for the output is 0.1972 m/s2. It can be seen that the
posterior mean of the output prediction-error variance s2acc for M2 is about 16 times the exact
measurement noise variance, or 4 times if we look at the prediction-error standard deviation, in
order to account for modeling errors. This prediction-error standard deviation is about 40% of
the maximum over floors of the RMS of acceleration data showing that the models in M2 have
significant modeling error. It can be seen that the posterior mean of s2acc for M1 is about 52% of
the exact measurement noise variance (about 72% if we look at the prediction-error standard
deviation) and is significantly smaller than that for M2 . The output prediction-error term for
M1 mostly accounts for the measurement noise while the state prediction-error term mostly
accounts for the modeling errors. The output prediction error for M2 has to account for both
the measurement noise and the modeling errors and thus its variance is larger than that for M1 .
For both model classes, modeling uncertainties are also accounted for by allowing for
uncertainty in the value of the structural parameters.
The posterior robust failure probability P ðF jD;Mi Þ of the benchmark structure subjected to
future uncertain earthquakes is calculated for M1 and M2 for different threshold levels based on
(3) and (4). The non-stationary stochastic model for the future earthquake is taken from [40].
Defining failure as the maximum interstory drift of any story of the structure exceeding some
threshold value, the posterior robust failure probability is calculated using a stochastic
simulation algorithm similar to the extremely efficient ISEE algorithm for linear dynamics [41].
These failure probabilities are readily calculated despite the very high number of uncertain
parameters in the stochastic simulation: 32 528 for M2 and 12 518 for M1 corresponding to the
model parameters and the discrete time-histories of the prediction errors and the excitation. The
results are shown in Figure 2 with P ðF jD;Mi Þ on the y-axis for different threshold levels (x-axis)
0
10
-1
10
Robust Failure Probability
-2
10
-3
10
0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5
Threshold (mm)
Figure 2. Posterior robust failure probability for the maximum interstory drift of any story exceeding the
threshold, calculated for model classes M1 and M2 .
DOI: 10.1002/stc
844 J. L. BECK
Table II. Results for model class assessment.

M1 M2
4
E½ln pðDjy;Mi Þ 1.5762 10 2.0251 104
EIG 76.12 63.52
ln pðDjMi Þ 1.5838 104 2.0315 104
P ðMi jD; MÞ 1.0000 0.0000
for model classes M1 and M2 . The posterior robust failure probability for M1 is substantially
lower than that calculated for M2 and the difference becomes even more pronounced as the
threshold level increases. These results show that the posterior failure probability is sensitive to
the choice of the stochastic model class and hence to the way that model uncertainties are
treated. Therefore, the posterior hyper-robust failure probability should be calculated using
Bayesian model averaging as in (13), which requires calculating the posterior probabilities of the
two candidate model classes.
The results for each model class using the stationarity method presented in [28] for calculating
the log evidence from posterior samples are shown in Table II, which gives the posterior mean
of the log likelihood function E[ln pðDjh; Mi Þ] (data-fit measure), EIG (expected information gain),
a model class complexity measure given by the last term in (10), the log evidence ln P ðDjMj Þ, and
the posterior probability P ðMi jD;MÞ. The model class M1 is much more probable than M2 based
on the data, implying that it gives a much better balance between the data-fit and the information
gain from the data. Although M1 has a larger expected information gain (third row), showing that
it extracts more information from the data than M2 , its mean data-fit is so much larger (second
row) that its log evidence is dominant over that of M2 , making M1 a much more plausible model
class based on the data. From the results in Table II and Figure 2, it can be seen that the term
P ðF jD;M2 ÞP ðM2 jD;MÞ that appears in the posterior hyper-robust failure probability P(F|D,M),
as calculated based on Equation (13), is negligible and so the contribution of M2 can be dropped
when calculating P ðF jD;MÞ; in fact, since the posterior probability P ðM2 jD;MÞ is so much
smaller than P ðM1 jD;MÞ, M2 is relatively highly improbable conditioned on the data D and so it
can be dropped when making robust predictions of any response of the benchmark structure.
6. CONCLUDING REMARKS
Probability logic provides a rigorous unifying framework for treating modeling uncertainty,
along with input uncertainty, when using models to predict the response of a system. This
framework is based solely on the axioms of probability logic which can be derived by extending
binary Boolean logic; they lead to the standard Kolmogorov axioms for a probability measure
as a special case where the propositions refer to uncertain membership of an object in a set.
Within this framework, the probability of a model is a measure of its relative plausibility within
a proposed set of models, conditional on specified information. This allows the probability of
models representing a system to be assessed based on data from the system. A key concept is a
stochastic system (or Bayesian) model class, which consists of a chosen set (usually
parameterized) of probabilistic input–output models for a system together with a chosen prior
probability distribution over this set to quantify the initial relative plausibility of each
input–output model. Any deterministic model of the input–output behavior of a system that
involves uncertain parameters can be used to construct a model class by stochastic embedding.
DOI: 10.1002/stc
A model class for a system can be used to produce prior and posterior robust predictions of
the system response which not only incorporate parametric uncertainty (uncertainty about which
model in a proposed set should be used to represent the input–output behavior of the system)
but also non-parametric uncertainty due to the existence of prediction errors because of the
approximate nature of any system model (e.g. from unmodeled dynamics). Prior robust analyses
are of importance in the robust design of systems [42,43] whereas posterior robust analyses can
be used to improve predictive modeling of already operating systems.
A set of candidate model classes can also be considered, leading to prior and posterior hyper-
robust predictions of the system response. In the latter case, the posterior probability of each
model class from Bayes’ Theorem weights its probabilistic contribution to the hyper-robust
predictions and if this weighted contribution is negligible, the model class may be dropped from
the set of candidate model classes; this provides a rigorous method for model class selection. For
the illustrative example, which involves a benchmark structure from the IASC-ASCE Structural
Health Monitoring Task Group, it is shown that the posterior probability of the model class
which includes both state and output prediction errors is overwhelmingly larger than the model
class with only the latter. The calculated posterior robust failure probabilities for the benchmark
structure subjected to a future uncertain earthquake are quite different for these two model
classes even though they have the same underlying deterministic state-space model. Thus, the
posterior robust failure probability is sensitive to the choice of the way that the model uncertainties
are introduced, which emphasizes the importance of model class assessment and averaging when
predicting system response, especially when calculating robust failure probabilities.
Robust predictive analyses involves integrals over high-dimensional parameter spaces that
cannot be evaluated in a straight-forward way. Useful computational tools for this purpose are
Laplace’s method of asymptotic approximation, which came to prominence around twenty years
ago, and Markov Chain Monte Carlo methods, which have become increasingly popular over
the last decade or so. Laplace’s asymptotic approximation of the posterior robust predictive
integral serves to provide a justification for the common practice of parameter estimation. This
approximation corresponds to making response predictions using the predictive model in the
model class given by a parameter estimate (maximum a posteriori, maximum likelihood or least-
squares), but this is only valid if the model class is globally identifiable based on the data and the
amount of data is not too small; otherwise the full posterior robust predictive PDF based on the
entire model class should be approximated using Markov Chain Monte Carlo methods in order
to make more accurate estimates of the robust system predictions.
Various applications of Markov Chain Monte Carlo methods have been published recently
for robust analysis, identification and health monitoring of structural dynamic systems; for
example, calculating robust reliability [21,27], Bayesian updating of linear structural models for
structural health monitoring using changes in modal parameter estimates [7,23], Bayesian
updating and model class assessment of unidentifiable hysteretic structural models [8] and of
dynamic structural models with a large number of uncertain parameters [25].
ACKNOWLEDGEMENTS
The author dedicates this paper to the memory of Professor George W. Housner. He was a scholar, a leader,
a gentleman, a mentor and a valued colleague at the California Institute of Technology. The author also
wishes to thank Dr Sai-Hung Cheung for permission to include the state-space example from his PhD thesis.
DOI: 10.1002/stc
846 J. L. BECK
REFERENCES
1. Yuen KV. Bayesian Methods for Structural Dynamics and Civil Engineering. Wiley: New York, NY, 2010.
2. Beck JL, Katafygiotis LS. Updating of a model and its uncertainties utilizing dynamic test data. Proceedings First
International Conference on Computational Stochastic Mechanics. Computational Mechanics Publications: Boston,
1991; 125–136.
3. Beck JL, Katafygiotis LS. Updating models and their uncertainties. I: Bayesian statistical framework. Journal of
Engineering Mechanics 1998; 124:455–461.
4. Katafygiotis LS, Beck JL. Updating models and their uncertainties. II: Model identifiability. Journal of Engineering
Mechanics 1998; 124:463–467.
5. Katafygiotis LS, Lam HF. Tangential-projection algorithm for manifold representation in unidentifiable model
updating problems. Earthquake Engineering and Structural Dynamics 2002; 31:791–812.
6. Papadimitriou C, Beck JL, Katafygiotis LS. Updating robust reliability using structural test data. Probabilistic
7. Yuen KV, Au SK, Beck JL. Structural damage detection and assessment using adaptive Markov Chain Monte
Carlo simulation. Journal of Structural Control and Health Monitoring 2004; 11:327–347.
8. Muto M, Beck JL. Bayesian updating and model class selection using stochastic simulation. Journal of Vibration
and Control 2008; 14:7–34.
9. Beck JL. Probability logic, information quantification and robust predictive system analysis. Technical Report
EERL 2008-05, Earthquake Engineering Research Laboratory, California Institute of Technology, Pasadena,
California, 2008.
10. Cox RT. Probability, frequency and reasonable expectation. American Journal of Physics 1946; 14:1–13.
11. Cox RT. The Algebra of Probable Inference. Johns Hopkins Press: Baltimore, MD, 1961.
12. Jaynes ET. In Papers on Probability, Statistics and Statistical Physics, Rosenkrantz RD (ed.). D. Reidel Publishing:
Dordrecht, Holland, 1983.
13. Jaynes ET. Probability Theory: The Logic of Science. Cambridge University Press: Cambridge, U.K., 2003.
14. Jaynes ET. Information theory and statistical mechanics. Physical Review 1957; 106:620–630.
15. Kolmogorov AN. Foundations of the Theory of Probability. Chelsea Publishing: New York, 1950 (Translation of
1933 original in German).
16. Loredo T. From Laplace to Supernova SN 1987A: Bayesian inference in astrophysics. In Maximum Entropy and
Bayesian Methods, Fougere PF (ed.). Kluwer Academic Publishers: Dordrecht, Holland, 1990; 81–142.
17. Bishop CM. Pattern Recognition and Machine Learning. Springer: New York, 2006.
18. Oh CK, Beck JL, Yamada M. Bayesian learning using automatic relevance determination prior with an application
to earthquake early warning. Journal of Engineering Mechanics 2008; 134:1013–1020.
19. Neal RM. Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1,
Department of Computer Science, University of Toronto, Toronto, Canada, 1993.
20. Robert CP, Casella G. Monte Carlo Statistical Methods. Springer: New York, 2004.
21. Beck JL, Au SK. Bayesian updating of structural models and reliability using Markov Chain Monte Carlo
simulation. Journal of Engineering Mechanics 2002; 128:380–391.
22. Ching J, Chen YC. Transitional Markov Chain Monte Carlo method for Bayesian updating, model class selection,
and model averaging. Journal of Engineering Mechanics 2007; 133:816–832.
23. Ching J, Muto M, Beck JL. Structural model updating and health monitoring with incomplete modal data using
Gibbs Sampler. Computer-Aided Civil and Infrastructure Engineering 2006; 21:242–257.
24. Cheung SH. Stochastic analysis, model and reliability updating of complex systems with applications to structural
dynamics. Ph.D. Thesis in Civil Engineering, California Institute of Technology, Pasadena, California, 2009.
25. Cheung SH, Beck JL. Bayesian model updating using Hybrid Monte Carlo simulation with application to
structural dynamic models with many uncertain parameters. Journal of Engineering Mechanics 2009; 135:243–255.
26. Beck JL, Yuen KV. Model selection using response measurements: a Bayesian probabilistic approach. Journal of
27. Cheung SH, Beck JL. New Bayesian updating methodology for model validation and robust predictions based on
data from hierarchical subsystem tests. Technical Report EERL 2008-04, Earthquake Engineering Research
Laboratory, California Institute of Technology, Pasadena, California, 2008.
28. Cheung SH, Beck JL. Calculation of the posterior probability for Bayesian model class assessment and averaging
from posterior samples based on dynamic system data. Computer-Aided Civil and Infrastructure Engineering 2010;
25:304–321.
29. Mackay DJC. Bayesian methods for adaptive models. Ph.D. Thesis in Computation and Neural Systems, California
Institute of Technology, Pasadena, California, 1992.
30. Cover TM, Thomas JA. Elements of Information Theory. Wiley-Interscience: Hoboken, NJ, 2006.
31. Ching J, Muto M, Beck JL. Bayesian linear structural model updating using Gibbs sampler with modal data.
Proceedings of the 9th International Conference on Structural Safety and Reliability, Rome, Italy, 2005.
DOI: 10.1002/stc
32. Gull SF. Bayesian inductive inference and maximum entropy. In Maximum Entropy and Bayesian Methods,
Skilling J (ed.). Kluwer Academic Publishers: Dordrecht, Holland, 1989; 53–74.
33. Jeffreys H. Theory of Probability (3rd edn). Oxford University Press: Oxford, U.K., 1961.
34. Box GEP, Jenkins GM. Time Series Analysis: Forecasting and Control. Holden-Day: San Francisco, CA, 1970.
35. Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control 1974;
19:716–723.
36. Schwarz G. Estimating the dimension of a model. The Annals of Statistics 1978; 6:461–464.
37. Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial (with Discussion).
Statistical Science 1999; 14:382–417.
38. Cheung SH, Beck JL. Comparison of different model classes for Bayesian updating and robust predictions using
stochastic state-space system models. Proceedings of the 10th International Conference on Structural Safety and
Reliability, Osaka, Japan, 2009.
39. Johnson EA, Lam HF, Katafygiotis LS, Beck JL. Phase I IASC-ASCE structural health monitoring benchmark
problem using simulated data. Journal of Engineering Mechanics 2004; 130:3–15.
40. Schueller GI, Pradlwarter HJ. Benchmark study on reliability estimation in higher dimensions of structural
systems-an overview. Structural Safety 2007; 29:167–182.
41. Au SK, Beck JL. First excursion probabilities for linear systems by very efficient importance sampling. Probabilistic
42. Taflanidis AA, Beck JL. An efficient framework for optimal robust stochastic system design using stochastic
simulation. Computer Methods in Applied Mechanics and Engineering 2008; 198:88–101.
43. Schueller GI, Jensen HA. Computational methods in optimization considering uncertainties—An overview.
Computer Methods in Applied Mechanics and Engineering 2008; 198:2–13.
DOI: 10.1002/stc

Bayesian System Identification Based On Probability Logic: James L. Beck

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Bayesian System Identification Based On Probability Logic: James L. Beck

Încărcat de

Drepturi de autor:

Formate disponibile

STRUCTURAL CONTROL AND HEALTH MONITORING

Struct. Control Health Monit. 2010; 17:825–847

Bayesian system identiﬁcation based on probability logic

Received 2 February 2010

2.1. Basic ideas

where f:[0,1]-[0,1] is a continuous, strictly increasing function with f(0) 5 0, f(1) 5 1.

P½ajbk & c P½bk jc

2.2. Kolomogorov’s axioms for probability measures

3. STOCHASTIC SYSTEM MODEL CLASSES

3.2. Bayesian updating within a model class

3.3. Model class construction by stochastic embedding of deterministic models

and the model output q for the same input, i.e.

3.4. Posterior robust predictive analysis using model classes

Some important special cases are:

unknown a priori and its evaluation requires a challenging high-dimensional integration

4. MODEL CLASS ASSESSMENT AND AVERAGING USING POSTERIOR

Given a discrete set of candidate model classes M ¼ fMj :j ¼ 1; 2; . . . NM g, the

pðDjMj ÞPðMj jMÞ

4.2. Hyper-robust predictions using model class averaging

5.1. Linear state-space formulation for model classes

S0j1 ¼ Rðhv Þ ð21Þ

and so p(YN|h) in (19) is given by:

Pnjn1 ¼ Aðhs ÞPn1jn1 Aðhs ÞT 1Qðhw Þ ð26Þ

ynjn1 ¼ Cðhs Þxnjn1 1Dðhs Þun ð27Þ

Snjn1 ¼ Cðhs ÞPnjn1 Cðhs ÞT 1Rðhv Þ ð28Þ

xnjn ¼ xnjn1 1Pnjn1 Cðhs ÞT S1

Pnjn ¼ Pnjn1  Pnjn1 Cðhs ÞT S1

From (17), it can be shown that:

5.2. Application to IASC-ASCE benchmark structure model

Table I. Posterior means and COV of the uncertain parameters.

Table II. Results for model class assessment.

S-ar putea să vă placă și

S0j1 ¼ Rðhv Þ ð21Þ

Pnjn1 ¼ Aðhs ÞPn1jn1 Aðhs ÞT 1Qðhw Þ ð26Þ

ynjn1 ¼ Cðhs Þxnjn1 1Dðhs Þun ð27Þ

Snjn1 ¼ Cðhs ÞPnjn1 Cðhs ÞT 1Rðhv Þ ð28Þ

xnjn ¼ xnjn1 1Pnjn1 Cðhs ÞT S1

Pnjn ¼ Pnjn1 Pnjn1 Cðhs ÞT S1