Documente Academic
Documente Profesional
Documente Cultură
Differential Geometry in
Statistical Inference
S.-l. Amari, O. E. Barndorff-Nielsen,
R. E. Kass, S. L. Lauritzen, and C. R. Rao
Volume 10
Differential Geometry in
Statistical Inference
S.-l. Amari, O. E. Barndorff-Nielsen,
R. E. Kass, S. L. Lauritzen, and C. R. Rao
TABLE OF CONTENTS
CHAPTER 1. Introduction
Robert E. Kass
19
95
163
217
in
CHAPTER 1.
INTRODUCTION
Robert E. Kass*
Robert E. Kass
recovered by conditioning on the second and higher derivatives of the loglikelihood function, evaluated at the MLE. Concerning information loss, recall
that according to the Koopman-Darmois theorem, under regularity conditions, the
families of continuous distributions with fixed support that admit finitedimensional sufficient reductions of i.i.d. sequences are precisely the exponential families. It is thus intuitive that (for such regular families) departures
from sufficiency, that is, information loss, should correspond to deviations
from exponentiality. The remarkable reality is that the correspondence takes a
beautifully simple form. The most transparent case, especially for the untrained eye, occurs for a one-parameter subfamily of a two-dimensional exponential
family. There, the relative information loss, in Fisher's sense, from using a
statistic T in place of the whole sample is
11m i()"[ni()-iT()] = 2 + \ 3 2
(1)
where ni() is the Fisher information in the whole sample, i () is the Fisher
information calculated from the distribution of T, is the statistical curvature of the family and 3 is the mixture curvature of the "ancillary family"
associated with the estimator T. When the estimator T is the MLE, 3 vanishes;
this substantiates Fisher's first claim.
In his 1975 paper, Efron derived the two-term expression for information loss (in his equation (10.25)), discussed the geometrical interpretation
of the first term, and noted that the second term is zero for the MLE. He
defined to be the curvature of the curve in the natural parameter space that
describes the subfamily, with the inner product defined by Fisher information
replacing the usual Euclidean inner product. The definition of 3 is exactly
analogous to that of , with the mean value parameter space used instead of the
natural parameter space, but Efron did not recognize this and so did not
identify the mixture curvature. He did stress the role of the ancillary family
associated with the estimator T (see his Remark 3 of Section 9 and his reply to
discussants, p. 1240), and he also noticed a special case of (1) (in his reply,
p. 1241). The final simplicity of the complete geometrical version of (1)
Introduction
appeared in Amari's 1982 Annals paper. There it was derived in the multiparameter case; see equation (4.8) of Amari's paper in this volume.
Prior to Efron's paper, Rao (1961) had introduced definitions of
efficiency and second-order efficiency that were intended to classify estimators
just as Fisher's definitions did, but using more tractable expressions. This
led to the same measure of minimum information loss used by Fisher (correspond2
ing to in equation (1)). Rao (1962) computed the information loss in the
case of the multinomial distribution for several different methods of estimation.
Rao (1963) then went on to provide a decision-theoretic definition of secondorder efficiency of an estimator T, measuring it according to the magnitude of
the second-order term in the asymptotic expansion of the bias-corrected version
of T. Efron's analysis clarified the relationship between Fisher's definition
and Rao's first definition. Efron then provided a decomposition of the secondorder variance term in which the right-hand side of (1) appeared, together with
a parameterization-dependent third term. The extension to the multiparameter
case was derived by Madsen (1979) following the outline of Reeds (1975). It
appears here in Amari's paper as Theorem 3.4.
An analytically and conceptually important first step of Efron's
analysis was to begin by considering smooth subfamilies of regular exponential
families, which he called curved exponential families. Analytically, this made
possible rigorous derivations of results, and for this reason such families
were analyzed concurrently by Ghosh and Subramaniam (1974). Conceptually, it
allowed specification of the ancillary families associated with an estimator:
the ancillary family associated with T at t is the set of points y in the sample
space of the full exponential family - equivalently, the mean value parameter
space for the family - for which T(y) = t. The terminology and subsequent
detailed analysis is due to Amari but, as noted above, the importance of the
ancillary family, at once emphasized and obscured by Fisher, was apparent from
Efron's presentation.
The ancillary family is also important in understanding information
Robert E. Kass
recovery, which is the reason Amari has chosen to use the modifier "ancillary."
In the discussion of Efron's paper, Pierce (1975) noted another interpretation
of statistical curvature: it furnishes the asymptotic standard deviation of
observed information. More precisely, it is the asymptotic standard deviation
Introduction
times the
general than the rough description here would imply since he allows for the use
of efficient estimators other than the MLE.)
As far as the basic issue of observed versus expected information is
concerned, Amari (1982b) used an Edgeworth expansion involving geometrically
interpretable terms (as in Amari and Kumon, 1983) to provide a general motivation for using the inverse of observed information as the estimate of the
conditional variance of the MLE. See Section 4.4 of the paper here. (In truth,
the result is not as strong as it may appear. When we have an approximation v
to a variance v satisfying v() = vn(){l + 0(n" )}, and we use it to estimate
v(), we substitute v (), where is some estimator of , and then all we
I/O
observed information does not in general provide an approximation to the conditional variance of the MLE based on the underlying true value , having
relative error 0 (n ) - although it does do so whenever expected information is
constant, as it is for a location parameter. Similarly, as Skovgaard, 1985,
points out in his careful consideration of the role of observed information in
inference, when estimated cumulants are used in an Edgeworth expansion it loses
its higher-order approximation to the underlying density at the true value.
This practical limitation of asymptotics does not affect Bayesian inference, in
which observed information furnishes a better approximation to the posterior
Robert E. Kass
the geometry of testing, as had Efron, providing comparisons of several commonlyused test procedures with the locally most powerful test. Curved exponential
families were introduced, however, for their mathemetical and conceptual
simplicity rather than their applicability. To extend his one-parameter
results, Efron, in his 1975 paper, did two things: he noted that any smooth
family could be locally approximated by a curved exponential family, and he
provided an explicit formula for statistical curvature in the general case.
In Section 5 of his paper, Amari shows how results established for curved
exponential families may be extended by constructing an appropriate Hubert
bundle, about which I will say a bit more below. With the Hubert bundle,
Amari provides a geometrical foundation, and generalization, for Efron's suggestion. From it, necessary formulas can be derived.
One reason that the role of the mixture curvature in (1) and in the
variance decomposition went unnoticed in Efron's paper was that he had not
made the underlying geometrical structure explicit: to calculate statistical
curvature at a given value Q of a single parameter in a curved exponential
family, Efron used the natural parameter space with the inner product defined
by Fisher information at the natural parameter point corresponding to Q. In
order to calculate the curvature at a new point ,, another copy of the natural
parameter space with a different inner product (namely, that defined by Fisher
information at the natural parameter point corresponding to ,) would have to be
used. The appropriate gluing together of these spaces into a single structure
involves three basic elements: a manifold, a Riemannian metric, and an affine
connection. Riemannian geometry involves the study of geometry determined by
the metric and its uniquely associated Riemannian connection.
In his discussion
Introduction
to Efron's paper, Dawid (1975) pointed out that Efron had used the Riemannian
metric defined by Fisher information, but that he had effectively used a nonRiemar. ian affine connection, now called the exponential connection, in calculating statistical curvature. Although Dawid did not identify the role of the
mixture curvature in (1), he did draw attention to the mixture connection as an
alternative to the exponential connection.
Robert E. Kass
inherited from the Hubert bundle. With this he obtains a satisfactory version
of the local approximation by a curved exponential family that Efron had
suggested.
This pretty construction allows results derived for curved exponential families to be extended to more general regular families, yet it is not
quite the all-encompassing structure one might hope for: the underlying
manifold is still a particular parametric family of densities rather than the
collection of all possible densities on the given sample space. Constructions
for the latter have so far proved too difficult.
In his Annals paper, Amari also noted an interesting relationship
between the exponential and mixture connections: they are, in a sense he
defined, mutually dual. Furthermore, a one-parameter family of connections,
which Amari called the -connections, may be defined in such a way that for each
the -connection and the --connection are mutually dual, while =l and -1
correspond to the exponential and mixture connections. See Amari's Theorem 2.1.
This family coincides with that introduced by Centsov (1971) for multinomial
distributions. When the family of densities on which these connections are
defined is an exponential family, the space is flat with respect to the exponential and mixture connections, and the natural parametrization and mean-value
parameterization play special roles: they become affine coordinate systems for
the two respective connections and are related by a Legendre transformation.
The duality in this case can incorporate the convex duality theory of exponential families (see Barndorff-Nielsen, 1978, and also Section 2 of his paper in
this volume).
family that minimizes the -divergence from a point in the exponential family
parameter space may be found by following the -geodesic that contains the
Introduction
given point and is perpendicular to the curved family. This generates a new
class of minimum -divergence estimators, the MLE being the minimum
-1-divergence estimator, an interpretation also discussed by Efron (1978).
As applications of his general methods based on -connections on
Hubert bundles, Amari treats the problems of combining independent samples (at
the end of section 5 ) , making inferences when the number of nuisance parameters
increases with the sample size (in section 6 ) , and performing spectral estimation in Gaussian time series (in section 7).
As soon as the -connections are constructed a mathematical question
arises. On one hand, the -connections may be considered objects of differential geometry without special reference to their statistical origin. On the
other hand, they are not at all arbitrary. They are the simplest one-parameter
family of connections based on the first three moments of the score function.
What is it about their special form that leads to the many special properties
of -connections (outlined by Amari in Section 2)1
Lauritzen has posed this question and has provided a substantial
part of the answer. Given any Riemannian manifold M with metric g there is a
unique Riemannian connection v. Given a covariant 3-tensor D that is symmetric
in its first two arguments and a nonzero number c, a new (symmetric) connection
is defined by
v =v+c
(2)
D(X,Y)
where
9
(D(X,Y),Z)
D(X,Y,Z)
for all vector fields Z. Now, when M is a family of densities and g and D are
defined, in terms of an arbitrary parameterization, as
10
Robert E. Kass
where % is the loglikelihood function, and if c = -/2, then (2) defines the
-connection.
In this statistical case, D is not only symmetric in its first two
arguments, as it must be in (2), it is symmetric in all three. Lauritzen
therefore defines an abstract statistical manifold to be a triple (M,g,D) in
which M is a smooth m-dimensional manifold, g is a Riemannian metric, and D is
a completely symmetric covariant 3-tensor. With this additional symmetry
constraint alone, he then proceeds to establish a large number of basic properties, especially those relating to the duality structure Amari described. The
treatment is "fully geometrical11 or "coordinate-free." This is aesthetically
appealing, especially to those who learned linear models in the coordinate-free
setting. Lauritzen's primary purpose is to show that the appropriate mathematical object of study is one that is not part of the standard differential
geometry, but does have many special features arising from an apparently simple
structure. He not only presents the abstract generalities about -connections
on statistical manifolds, he also examines five examples in full detail. The
first is the univariate Gaussian model, the second is the inverse Gaussian
model, the third is the two-parameter gamma model, and the last two are
specially constructed models that display interesting possibilities of the nonstandard geometries of -connections. In particular, the latter two statistical
manifolds are not what Lauritzen calls "conjugate symmetric" and so the
sectional curvatures do not determine the Riemann tensor (as they do in
Riemannian geometry). He also discusses the construction of geodesic foliations, which are decompositions of the manifold and are important because they
generate potentially interesting decompositions of the sample space. At the
end of his paper, Lauritzen calls attention to several outstanding problems.
Amari's -connections, based on the first three moments of the
score function, do not furnish the only examples of statistical manifolds. In
his contribution to this volume, Barndorff-Nielsen presents another class of
examples based instead on certain "observed" rather than expected derivatives
Introduction
11
of the loglikelihood.
Although the idea of using observed derivatives might occur to
any casual listener on being told of Amari's use of expectations, it is not
obvious how to implement it. First of all, in order to define an observed
information Riemannian metric, one needs a definition of observed information
at each point of the parameter space. Apparently one would want to treat each
as if it were an MLE and then use I(). However, I() depends on the whole
sample y rather than on alone, so this scheme does not yet provide an explicit
definition. Barndorff-Nielsen's solution is natural in the context of his
research on conditionality: he replaces the sample y with a sufficient pair
(,a) where a is the observed value of an asymptotically ancillary statistic A.
This is always possible for curved exponential families, and in more general
models A could at least be taken so that (,A) is asymptotically sufficient.
With this replacement, the second component may be held fixed at A=a while
varies. Writing I() = I,g () thus allows the definition I() I, JQ)
to be made at each point in the parameter space. Using this definition of
the Riemannian metric, Barndorff-Nielsen derives the coefficients that determine the Riemannian connection. From the transformation properties of tensors,
he then succeeds in finding an analogue of the exponential connection based on
a certain mixed third derivative of the loglikelihood function (two derivatives
being taken with respect to as parameter, one with respect to as MLE). In
so doing, he defines the tensor D in the statistical manifold and thus arrives
at his "observed conditional geometry."
Barndorff-Nielsen's interest in this geometry lies not with
analogues of statistical curvature and other expected-geometry constructs, but
rather with an alternative derivation, interpretation, and extension of an
approximation to the conditional density of the MLE, which had been obtained
earlier (in Barndorff-Nielsen and Cox, 1979).
12
Robert E. Kass
p( |a,) = c
L()/L()
(3)
Hinkley, 1978; Fisher actually treated the location-scale case.) The formula
is of great importance both practically, since it provides a way of computing
the conditional density, and philosophically, since it entails the formal
agreement of conditional inference and Bayesian inference using an invariant
prior. Inspection of the derivation indicates that the formula results from
the transformational nature of the location problem, and Barndorff-Nielsen has
shown that a version of it (with an additional factor for the volume element)
holds for yery general transformation models. He has also shown that for nontransformation models, a version of the right-hand side of (3) while not
exactly equal to the left-hand side, remains a good asymptotic approximation for
it.
(See also Hinkley, 1980, and NlcCullagh, 1984a.) In his paper in this
volume, Barndorff-Nielsen reviews these results, shows how the various observed
conditional geometrical quantities are calculated, and then derives his desired
expansion (of a version of the right-hand side of (3)) in terms of the geometrical quantities that correspond to those used by Amari in his expected
geometry expansions. Barndorff-Nielsen devotes substantial attention to transformation models, which may be treated within his framework of observed
conditional geometry.
Introduction
13
**
Since Rao's work on second order efficiency arose in an attempt to understand
Fisher's computation of information loss in estimation, it might appear that
Efron's investigation also began as an attempt to understand Fisher. He has
informed me, however, that he set out to define the curvature of a statistical
model and came later to its use in information loss and second-order efficiency.
14
Robert E. Kass
sufficiency, and efficiency, information loss and recovery form the core of
Fisher's theory of estimation. On the basis of the geometrical results, it is
fair to say that we now know what Fisher was talking about, and that what he
said was true. Here, as in other problems (such as inference with nuisance
parameters, discussed in Amari's section 5, or in nonlinear regression, e.g.,
Bates and Watts, 1980, Cook and Tsai, 1985, Kass, 1984, McCullagh and Cox, 1936),
the geometrical formulation tends to shift the burden of derivation of results
away from proofs, toward definitions. Thus, once the statement of a proposition
is understood, its truth is easier to see and in this there is great simplification. One could make this argument about much abstract mathematical development, but it is particularly appropriate here.
Furthermore, there are reasons to think that future work in this
area could lead to useful results that would otherwise be difficult to obtain.
One important problem that structural research might solve is that of determining useful conditions under which a particular root of the likelihood equation
will actually maximize the likelihood. Global results on foliations might be
very helpful, as might be formulas relating computable characteristics of
statistical manifolds to the behavior of geodesies. The results in these papers
could turn out to play a central role in the solution of this or some other
practical problem of statistical theory. We will have to wait and see. Until
then, readers may enjoy the papers as informative excursions into an intriguing
realm of mathematical statistics.
Acknowledgements
I thank 0. E. Barndorff-Nielsen, D. R. Cox, and C. R. Rao for their
comments on an earlier draft. This paper was prepared with support from the
National Science Foundation under Grant No. NSF/DMS - 8503019.
REFERENCES
Amari, S. (1982a). Differential geometry of curved exponential families curvatures and information loss. Ann. Statist. 10, 357-387.
Amari, S. (1982b). Geometrical theory of asymptotic ancillarity and conditional
inference. Biometrika 69, 1-17.
Amari, S. and Kumon, M. (1983). Differential geometry of Edgeworth expansions
in curved exponential family. Ann. Inst. Statist. Math. 35A, 1-24.
Barndorff-Nielsen, 0. E. (1978).
16
Robert E. Kass
22^, 700-725.
Fisher, R. A. (1934). Two new properties of mathematical likelihood.
Proc.
Introduction
il
1.
Introduction
21
2.
25
3.
38
4.
52
5.
59
6.
73
7.
83
8.
References
91
19
1.
INTRODUCTION
It should, however, be
22
Shun-ichi Amari
natural Riemannian metric given by the Fisher information matrix in the regular
case. It represents only a local property of the model, because the tangent
space is nothing but local linearization of the model manifold.
In order to
obtain larger-scale properties, one needs to define mutual relations of the two
different tangent spaces at two neighboring points in the model. This can be
done by defining a one-to-one affine correspondence between two tangent spaces,
which is called an affine connection in differential geometry. By an affine
connection, one can consider local properties around each point beyond the
linear approximation. The curvature of a model can be obtained by the use of
this connection.
23
The new
developments were also shown in the NATO Research Workshop on Differential Geometry in Statistical Inference (see Barndorff-Nielsen (1985) and Lauritzen
(1985)).
(1982), Pfanzagl (1982), Beale (1960), Bates and Watts (1980), etc., for other
geometrical work.)
The present article gives not only a compact review of various
achievements up to now by the differential geometrical method most of which have
already been published in various journals and in Amari (1985) but also a preview of new results and half-baked ideas in new directions, most of which have
not yet been published.
method, and elucidates fundamental geometrical properties of statistical manifolds. Chapter 3 is devoted to the higher-order asymptotic theory of statistical inference, summarizing higher-order characteristics of various estimators
and tests in geometrical terms. Chapter 4 discusses a higher-order theory of
asymptotic sufficiency and anci'llarity from the Fisher information point of
view. Refer to Amari (1985) for more detailed explanations in these chapters;
Lauritzen (1985) gives a good introduction to modern differential geometry. The
remaining Chapters 5, 6, and 7 treat new ideas and developments which are just
under construction.
24
Shun-ichi Amari
sity functions on X is a subset of the L, space of functions in x, S is considered to be a subset of the L . space. A statistical model S is said to be geometrically regular, when it satisfies the following regularity conditions
A-. - Ag, and S is regarded as an n-dimensional manifold with a coordinate system
.
It is
26
Shun-ichi Amari
Figure 1
possible to weaken Condition 1. However, only local properties are treated
here so that we assume it for the sake of simplicity.
In a later section, we
assume one more condition which guarantees the validity of Edgeworth expansions.
Let us denote by 3. = 3/31 the tangent vector e. of the i-th
coordinate curve 1 (Fig. 1) at point . Then, n such tangent vectors e. = 3.,
i = 1,..., n, span the tangent space T
A = AV,
where A
It can be done
by defining the inner product g i () = <3., 3 > of two basis vectors 3. and 3.
at . To this end, we represent a vector 3-T. by a function 3.(x,) in x,
where (x,)
(2.1)
27
where E Q denotes the expectation with respect to p(x,). This g.. is the
Ij
at another point '. This can be done by comparing the basis vectors a. at T
with the basis vectors a1, at T .. Since T A and T l are two different vector
ID
spaces, the two vectors a. and a, are not directly comparable, and we need some
way of identifying T n with T . in order to compare the vectors in them. This
can be accomplished by introducing an affine connection, which maps a tangent
space T Q X ._ at + d to the tangent space T n at . The mapping should reduce
to the identity map as d->0. Let m(a'.) be the image of 3' T Q , . mapped to T .
J
m(3'.) - 3.]
DTQU
represents the rate at which the j-th basis vector a. T "intrinsically11 changes
J
Figure 2
(2.2)
28
Shun-ichi Amari
and
Let A ( ) be a vector
i JK
intrinsic change of the vector A() as the position moves is now given by the
covariant derivative in the direction a. of A() = A J ()a., defined by
in which the change in the basis vectors as well as that in the components
A () is taken in
into account. The covariant derivative in the direction B =
Ba
is given by
VBRA = B Y3.A.
We have defined the covariant derivative by the use of the basis
vectors a. which are associated with the coordinate system or the parametrization . However, the covariant derivative vJ\ is invariant under any parametrization, giving the same result in any coordinate system. This yields the transformation law for the components of a connection r... . When another coordinate
IJ K
system (parametrization) ' = '() is used, the basis vectors change from
{a.} to {a 1 .,}, where
a1., = B ] , 3
1*
and B , = ae /ae
'
1
dinate transformation. Since the components r1.., . ,k, of the connection are
written as
<va
j"9kl>
D i f f e r e n t i a l Geometrical Theory of S t a t i s t i c s
()
29
1-
ijk
ij
'
' j
It is easily checked that the connection defined by (2.3) satisfies the transformation law. In particular, the 1-connection is called the exponential connection, and the -1-connection is called the mixture connection.
2.2
a= B a ( u ) V
a
(a = 1,...,m) or Ba = (Ba ) in the component form with respect to the basis 3.
l
of T
(u)< >
It is convenient to define n - m vectors 3 , K = m + 1,...,n in
T \u)
, x(S) such that n vectors {3a ,3K }, a = 1,...,m; K = m + l,...,n, together
form a basis of T / J S ) and moreover 3 's are orthogonal to 3 's, (Fig. 3 ) ,
\U)
Shun-ichi Amari
30
The v e c t o r s
span t h e o r t h o g o n a l
K
c o m p l e m e n t o f T (M) i n T , \ ( S ) .
U
^U)
We d e n o t e
The inner
Figure 3
9ab
(u)
< 3
a'V
= B
<3
'V "
a b ij '
(2.4)
" W
The basis vector aa may change in its direction as point u moves in
'aa of a
in the
M. The change is measured by the -covariant derivative
a
direction a. , where the notion of a connection is necessary, because we need to
1
compare two vectors a and a belonging to different tangent spaces T / \(S) and
(u
T iu
/ i \(S). The -covariant derivative vd^
is calculated in S as
b
When the directions of the tangent space T (M) of M do not change as point u
moves in M, the manifold M is said to be -flat in S, where the tangent directions are compared by the -connection. Otherwise, M is curved in the sense of
the -connection. The -covariant derivative v ,a / a aa is decomposed into the
b
Differential
31
tl=<^a)\^>--(\i + <^)SX
(2.5)
a
which is a tensor called the -curvature of submanifold M in S. It is usually
called the imbedding curvature or Euler-Shouten curvature. This tensor represents how M is curved in S. A tensor is a multi-linear mapping from a number of
tangent vectors to the real set. In the present case, for A = A a 9 T (M)
au
B = B 9.T (M) and C = C K 9 belonging to the orthogonal complement of T (M), we
H ( ) hr
(A,B,C)
= H ( ^ AaBbC.
have the multi-linear mapping
,
This F ' is the -curvature tensor, and H\^' are i t s components. The submanifold M is -flat in S when H u
0 holds.
The m x m matrix
abK
h
ab
acK bd g
and g ,, respectively.
or shortly by , where
vector 9. = does not change along the curve in the sense of the -connection,
the curve is called an -geodesic. By choosing an appropriate parameter, an
-geodesic (t) satisfies the geodesic equation
32
Shun-ichi Amari
v ^ =0
or in the component form
1 + rf J G ^ J k = 0 .
(2.6)
general depends on the path c: (t) connecting and . When this does not
depend on paths, the manifold is said to be flat. It is known that a manifold
is flat when, and only when, the Riemann-Christoffel curvature vanishes identically (see textbooks of differential geometry). A statistical manifold S is
said to be -flat, when it is flat under the -connection.
The parallel displacement does not in general preserve the inner
product, i.e., < A, B> = <A,B> does not necessarily hold. When a manifold has
two affine connections with corresponding parallel displacement operators
33
(2.7)
holds, the two connections are said to be mutually dual. The two operators
and * are considered to be mutually adjoint. We have the following theorem
in this regard (Nagaoka and Amari (1982)).
Theorem 2.1. The -connection and --connection are mutually dual.
When S is -flat, it is also --flat.
When a manifold S is -flat, there exists a coordinate system ( 1 )
such that
7 ft () 3. = 0 or rj^() = 0
identically holds. In this case, a basis vector d. is the same at any point
in the sense that 3 T is mapped to 3,.TQl by the -parallel displacement
I
I t )
irrespective of the path connecting and 1 . Since all the coordinate curves
are -geodesics in this case, is called an -affine coordinate system. A
linear transformation of an -affine coordinate system is also -affine.
We give an example of a 1-flat (i.e., = 1) manifold S. The
density functions of exponential family S = {p(x,)} can be written as
p(x,) = exp{ . - ()}
with respect to an appropriate measure, where = ( ) is called the natural or
canonical parameter. From
3.(x,) = x - a.(),
a^.JiU.) = -3 3j() ,
we easily have
g
ij
() =
Vj
( ) >
ijk
() =
"T i V k
3 = 3/a.j be the tangent vector of the coordinate curve n in the new coordin-
34
Shun-ichi Amari
ate system . The vectors O 1 } form a basis of the tangent space T (i.e. at
T n where = ()) of S. When the two bases O } and {a 1 } of the tangent space
T satisfy
at ewery point (or ), where '? is the Kronecker delta (denoting the unit
matrix), the two coordinate systems and are said to be mutually dual.
(Nagoaoka and Amari (1982)).
Theorem 2.2. When S is -flat, there exists a pair of coordinate
systems = ( ) and = (.) such that i) is -affine and is --affine,
ii) and are mutually dual, iii) there exist potential functions () and
() such that the metric tensors are derived by differentiation as
a j > = a'aj() ,
where g. . and g J are mutually inverse matrices so that
= 3.()
(2.8)
(2.9)
where = ..
is -1-affine, and are mutually dual, and the dual potential () is given
by the negative entropy,
() = E[log p] ,
where the expectation is taken with respect to the distribution specified by .
Differential
2.4
Geometrical Theory of S t a t i s t i c s
35
I t is defined by
D (,') = () + (') - ' ,
(2.10)
where 1 = n(') are the -coordinates of the point 1 , i.e., the --coordinates
of the distribution p(x,').
Leibler information,
'
(2.11)
', and let c' be a --geodesic connecting two points and " in an -flat
S. When the two curves c and c 1 intersect at ' with a right angle such that
1
, and " form a right triangle, the following Pythagorean relation holds,
36
Shun-ichi Amari
D (,') + D (\") = D (,") .
(2.12)
first given by Amari (1982a) and by Nagaoka and Amari (1982) in more general
form.
Figure 4
37
Ancillary family
Let S be an n-dimensional exponential family parametrized by the
natural parameter = ( ) and let M = q(x,u)} be an m-dimensional family
parametrized by u = (u a ), a = 1,..., m. M is said to be an (n,m)-curved exponential family imbedded in S = {p(x,)} by = (u), when q(x,u) is written as
q(x,u) = exp[ 1 (u)x i - {(u)}].
The geometrical structures of S and M can easily be calculated as follows. The
quantities in S in the -coordinate system are
ijk
Bj-a^u).
ai = W
39
IT q(x/,xu)
= exp[N{ (u)x.
- (u)}}],
[3)
j=l
the geometrical structure of M based on N observations is the same as that
based on one observation except for a constant factor N. We treat statistical
inference based on x. Since a point x in the sample space X can be identified
with a point = x in S by using the expectation parameter , the observed sufficient statistic x defines a point in S whose -coordinates are given by x,
= x. In other words, we regard x as the point (distribution) in S whose
expectation parameter is just equal to x. Indeed, this is the maximum likelihood estimator in the exponential family S.
Let us attach an (n-m)-dimensional submanifold A(u) of S to each
point uM, such that all the A(u)'s are disjoint (at least in some neighborhood
of M, which is called a tubular neighborhood) and the union of A(u)'s covers S
(at least the tubular neighborhood of M ) . This is called a (local) foliation of
S. Let v = (v ), = m + l , . . . , n b e a coordinate system in A(u). We assume
that the pair (u,v) can be used as a coordinate system of the entire S (at
least in a neighborhood of M ) . Indeed, a pair (u,v) specifies a point in S such
that it is included in the A(u) attached to u and its position in A(u) is given
by v (see Fig. 5). Let = (u,v) be the -coordinates of the point specified
by (u,v). This is the coordinate transformation of S from w = (u,v) to ,
where w = (u,v) = (w 3 ) is an n-dimensional variable, = 1,..., n, such that its
first m components are u = (u a ) and the last n - m components are v = (v ).
40
Shun-ichi Amari
Figure 5
Any point (in some neighborhood of M) in S can be represented uniquely by
w = (u,v). We assume that A(u) includes the point = (u) on M and that the
origin v = 0 of A(u) is put at the point uM. This implies that n(u,0) is the
point (u)M. We call A = {A(u)} an ancillary family of the model M.
In order to analyze the properties of a statistical inference
method, it is helpful to use the ancillary family which is naturally determined
by the inference method. For example, an estimator u can be regarded as a mapping from S to M such that it maps the observed point = x in S determined by
the sufficient statistic x to a point u(x)M.
41
inverse image T~^(r)<= S is called the critical region, and the hypothesis is
rejected when and only when the observed point = xS is in T (r). In order
to analyze the characteristics of a test, it is convenient to use an ancillary
family A = {A(u)} such that the critical region is composed of some of the
A(u)'s and the acceptance region is composed of the other A(u)'s. Such an
ancillary family is said to be associated with the test T.
In order to analyze the geometrical features of ancillary submanifolds, let us use the new coordinate system w = (u,v). The tangent of the
R
= 3 r
pi
where
B
= B
aiBbjglJ
(3 2 )
(3.4)
i KJ
represents the angles between the tangent spaces of M and A(u). When g
dK
(u,0)
4 2
S h u n - i c h i Amari
A = {A(u)} is said to be orthogonal, when g (u) = 0, where f(u) is the abbreviation of f(u,0) when a quantity f(u,v) is evaluated on M, i.e., at v = 0.
We may treat an ancillary family A., which depends on the number N of observations. In this case g _ also depends on N. When g a = <9 ,8 > is a quantity of
otp
aic
-1 /2
order N
converging to 0 as N tends to i n f i n i t y , the ancillary family is
said to be asymptotically orthogonal.
The -connection in the w-coordinate system is given by
where T
tion of M and the A-part r \ ' gives those of the -connection of A(u). When A
is orthogonal, the -curvatures of M and A(u) are given respectively by
()
()
ab " ab '
() _
()
a " a "
(7
(3
r\
'6)
(3.7)
The sufficient statistic x is thus decomposed into two parts (u,v) which together are also sufficient. When the ancillary family A is associated with an
estimator or a test, u gives the estimated value or the test statistic,
43
Let us put
v=yfv,
(3.8)
"
4C 3 h
1A ^ )
O^3'2)} ,
(3.9)
24K 3 h
72K K h
where n(w*;g _) is the multivariate normal density with mean 0 and covariance
otp
C
C
2 _ (m) (m)
.
" g 9 9
'etC"
44
Shun-ichi Amari
where D = g (3/3w ), cf. Amari and Kumon (1983), McCullagh (1984). Hence,
h = w ,
h = 1,
i_ot
h = ww - g ,
cc
= w w w - g
w - g w
6 ot
w , etc.
of the
joint distribution of u* and v*, which together carry the full Fisher information. The marginal distribution can easily be obtained by integration.
Theorem 3.2. When the ancillary family is orthogonal, i.e., g a (u)
= 0, the distribution p(u*,u Q ) of u* is given by
p(u*,u 0 ) = n(u*;gab){l + J N - 1 / 2 K a b ( . h a b c
+ r f \ ( u * ) } + 0(N' 3 / 2 )
where
abc
(3.10)
c hb
= L
ab = (^4 2 ^ 4 (ab
f[T m;) 2 = (m) (m)g ce
df
ab cda efb 9 '
, ex2 _ (e) (e) cd K
M ab " H ac H bd g g '
(H }
^ - u(m) (m)
H
H
g
A ab " va b
]
nv^
family by A-Ju).
45
= (g
ancillary family is orthogonal. From this and Theorem 3.2, we have the following results.
Theorem 3.3. A consistent estimator is first-order efficient, iff
the associated ancillary family is orthogonal. An efficient estimator is always
second-order efficient, because of g 2 = 0.
There exist no third-order efficient estimators in the sense that
g~ (u) is minimal at all u. This can be checked from the fact that g^ includes
a term linear in the derivative of the mixture curvature of A(u), see Amari
(1985).
u* = u - E-[u] of an efficient estimator u, we see that there exists the thirdorder efficient estimator among the class of all the bias-corrected efficient
estimators. To state the result, let g 3 a b = g^ 9 c a 9 b d be the lower index
_r g~
ab .
version of
Theorem 3.4. The third-order term g~ . of the covariance of a biascorrected efficient estimator u* is given by the sum of the three non-negative
geometric quantities
(3 1 2 )
46
Shun-ichi Amari
"
47
(3.13)
(3.14)
Let us expand it as
P(u t ,N) = P (t) + P 2 (t)N" 1 / 2 + P 3 (t)N" 1 + 0(N" 3 / 2 ) .
It is clear that a test T is i-th order uniformly efficient, iff
P k (t) = P k (t)
holds at any t for k = l,...,i.
An ancillary family A = {A(u)} in this case consists of (n-1)dimensional submanifolds A(u) attached to each u or (u)M. The critical
region R is bounded by one of the ancillary submanifolds, say A(u + ), in the
48
Shun-ichi Amari
one-sided case, and by two submanifolds A(u + ) and A(u_) in the two-sided unbiased case. The asymptotic behavior of a test T is determined by the geometric
features of the boundary 8R, i.e., A(u+)[and A(u_)].
between M and A(u) is important. The angle is given by the inner product
9aa (u)
< =a <9 ,3 > of the tangent da of M and tangents 9K of A(u). When ga K (u) =
0 for all u, A is orthogonal. In the case of a test, the critical region and
hence the associated ancillary A and g (u) depend on N. An ancillary family is
said to be asymptotically orthogonal, when g (u) is of order N-1 /2 . We can
aK
assume gaK
can be expanded as
a (uu
n ) = 0, and g3
a (u.)
L
anc
* when
there exist no tests T 1 satisfying P 3 (t) > PyoU) for all t. An efficient
test is third-order admissible, when it is t - efficient at some t Q . We define
the third-order power loss function (deficiency function) P (t) of an efficient
test T by
P (t) = lim N{P(u t ,N) - P (u t ,N)} = P 3 (t) - P^t) .
(3.16)
49
(3.17)
holds for some constant c. The third-order power loss function is then given by
P (t) = a.(t,){c - J.(t,)}22 ,
(3.18)
where a.(t,) is some fixed function of t and , being the level of the test,
2 = h e ) H^ e ) a c ^
19}
is the square of the exponential curvature (Efron's curvature) of M, and
J (t,) = 1 - t/{2u, ()},
J 2 (t,) = 1 - t/[2u 2 ()tanh{tu 2 ()}],
i = 1 for the one-sided case and i = 2 for two-sided case, n being the standard
normal density function, and u^() and u 2 () being the one-sided and two-sided
100% points of the normal density, respectively.
The theorem shows that a third-order admissible test is characterized by its c value. It is interesting that the third-order power loss function
2
(3.18) depends on the model M only through the statistical curvature , so that
2
Pj(t)/ gives a universal power loss curve common to all the statistical
models.
It depends only on the value of c. Various widely used tests will next
50
Shun-ichi Amari
has fairly good performances throughout a wide range of t, while the locally
most powerful test behaves badly when t > 2. The m.l.e. test is good at around
t = 3^4.
We can generalize the present theory to the multi-parameter cases
with and without nuisance parameters. It is interesting that none of the
above tests are third-order admissible in the multi-parameter case. However, it
is easy to modify a test to get a third-order t^-efficient test by the use of
the asymptotic ancillary statistic a (Kumon and Amari, 1985). We can also
design the third-order tQ-most-powerful confidence region estimators and the
third-order minimal size confidence region estimators.
It is also possible to extend the present results of estimation and
testing in a statistical model with nuisance parameter . In this case, a set
M ( U Q ) of distributions in which the parameter of interest takes a fixed value
ufi, but takes arbitrary values, forms a submanifold. The mixture curvature
and the exponential twister curvature of M(u Q ) are responsible for the higherorder characteristics of statistical inference. The third-order admissibility
of the likelihood ratio test and others is again proved. See Amari (1985).
N4P(t)/r
d = 0.05
o.s
Figure 6
NP (t)/
= 0.05
Ohe-sided t ts
score test
(locally most porfffuf test)
m.l.e. test
likelihood ratio test
Figure 7
51
4.
(4.1)
where (t,u) is the logarithm of the density function of t when the true parameter is u. The information g at) (T) is a positive-semidefinite matrix depending
on u. Obviously, for the statistic X, g a b (X) is the Fisher information matrix.
Let T(X) and S(X) be two statistics. We similarly define, by using the joint
distribution of T and S, the amount g b (T,S) of information which T and S together carry. The additivity
hb^
does not hold except when T and S are independent. We define the amount of
conditional information carried by T when S is known by
g a b (T|S) = E s E T | $ [8 a (t|s,u)a b (t|s,u)] ,
(4.2)
where (t|s,u) is the logarithm of the conditional density function of T conditioned on S. Then, the following relation holds,
9ab(T'S)
From gab(S|"O
9ab(T)
9ab ( S l T > = 9 a b ( S )
see
t h a t the
conc
9ab(TlS)
*itional information
53
is the amount of loss of information when we keep only t(x) instead of keeping
the original x. The following relation is useful for calculation,
g a b (T) = ECov[3a<(x,u),3bs,(x,u)|t] ,
(4.4)
W S I T > = W 1 ) - 9ab(T S) .
(4.5)
9ab
{T) =
(T
9abW-9ab >
= 0(N
'
q+1
>
(4.7)
54
Shun-ichi Amari
sufficient. In this case, u is (first-order) efficient. The loss of information of an efficient estimator u is calculated as
(4 8
>
where (H^) is the square of the exponential curvature of the model M and (H^)
is the square of the mixture curvature of the associated ancillary family A at
v = 0. Hence, the loss of information is minimized uniformly in u, iff the
mixture curvature of the associated ancillary family A(u) vanishes at v = 0 for
all u. In this case, the estimator u is third-order efficient in the sense of
the covariance in 3. The m.l.e. is such a higher-order efficient estimator.
Among all third-order efficient estimators, does there exist one
whose loss of information is minimal at all u up to the term of order N~ ? Is
the m.l.e. such a one? This problem is related to the asymptotic efficiency of
estimators of order higher than three. By using the Edgeworth expansion (3.9)
and the stochastic expansion of 3 a (x,u), we can calculate the terms, which
a
depend on the estimator, of the information loss of order N~ in geometrical
terms of the related ancillary family. The loss of order N~ includes a term
related to the derivatives of the mixture curvature H \ l of A in the direction
Ka
of
a and
(unpublished
note). loss
Fromgthis
formula, one can conclude that
there
exist3a no
estimators whose
a b (U) of information is minimal up to
the term of order N~ at all u among all other estimators. Hence, the loss of
information of the m.l.e. is not uniformly minimal at all u, when the loss is
55
(4.9)
K y
It is always possible to
A(u). This coordinate system is indeed given by the ( = - l/3)-normal coordinate system at v = 0. The v* is second-order ancillary in this coordinate
system. By evaluating the term of order N" in (4.9), we can prove that there
exists in general no third-order ancillary v.
However, Skovgaard (1985), by using the method of Chernoff (1949),
showed that one can always construct an ancillary v of order q for any q by
modifying v successively. The q-th order ancillary v i s a function of x
depending on N. Hence, our previous result implies only that one cannot in
general construct the third-order ancillary by using a function of x not depending on N, or by relying on an ancillary family A = {A(u)} not depending on N.
There is no reason to stick to an ancillary family not depending on N, as
Skovgaard argued.
4.3 Decomposition of information
Since (u,v) together are sufficient, the information lost by summarizing x into u is recovered by knowing the ancillary v. The amount of
recovered information g b (V|U) is equal to g a b (U).
In order to recover the information of order 1 in 9 a b (U)> not all the components of v are necessary. Some functions of v can recover the full information
56
Shun-ichi Amari
a
a
Vi
<4'10)
a
P
, and when
to T (A) is given by
Let T M (A)_ (p > 2) be the subspace of T,,(A) spanned by vectors K
Up-
,...K
Tu(A)p.
We call
, and let P
-|do
( H }
g a i b i g
"*
VlVl
"
( I
*1
(4
fe)
There exists a finite p n such that H v ; a vanishes for p > p n .
a
a
0
r p
"
Now let us consider the following sequence of statistics,
T, = {G},
T = H^ e ] (G)v5... .
57
V V
2 .W
= {
T }
p-r
<W*> " pl
The theorems imply the following. An efficient estimator u carries
all the information of order N. The ancillary v, which together with u carries
the remaining smaller-order information, is decomposed into the sum of p-th
order curvature-direction components a
= H* e '
- . . . a
d . . . a,
]
]
p
? to
the missing information of order N-D+2
relative
,. The proof is obtained
p i
33
pi
3 (x,u)al ...P
a i . . . a
and by c a l c u l a t i n g g k ( T I ) . The i n f o r m a t i o n c a r r i e d by 3 a
3a &(x,u)
D p pI
a - ...
is equivalent to (3 B
a d
)B .v or h e ^
KI
v relative to , up to the
-| . .
P~ '
necessary order.
4.4. Conditional inference
When there exists an exact ancillary statistic a, the conditionality
principle requires that statistical inference should be done by conditioning on
a. However, there exist no non-trivial ancillary statistics in many problems.
Instead, there exists an asymptotically ancillary statistic v, which can be
refined to be higher-order ancillary. The asymptotic ancillary statistic carries information of order 1, and is very useful in improving higher-order
characteristics of statistical inference. For example, the conditional covariance of an efficient estimator is evaluated by
58
Shun-ichi Amari
It is interest-
5.
(5.1)
If we put
function
q(x,u) + r(x)q(x,u)
59
60
Shun-ichi Amari
a.
O.O.
sum
H
Tu
(5.2)
is called the fibre bundle with base space M and fibre space H. Since the fibre
space is a Hubert space, it is called a Hubert bundle of M. It should be
noted that H and H , are different Hubert spaces when u f u 1 . Hence, it is
convenient to establish a one-to-one correspondence between H and H ,, when u
and u1 are neighboring points in M. When the correspondence is affine, it is
called an affine connection. Let us assume that a vector r(x)H at u corresponds to r(x) + dr(x)H + . at a neighboring point u + du, where d denotes
infinitesimay small change. From
E u + ( J u [r(x) + dr(x)] = j{q(x,u) + dq(x,u)Mr(x) + dr(x)}dP
= E u [r] + E u [dr(x) + a a *(x,u)r(x)du a ] = 0
and E [r] = 0, we see that dr(x) must satisfy
E d r ] = - E[3_A] du a ,
u
Differential
61
=-
(5.3)
field, which attaches a vector r(x,u) to ewery point uM. Then, the rate of
the intrinsic change of the vector r(x,u) as u changes in the direction 3
is
a
given by the -covariant derivative,
K M ) =U U M Tu(M) .
which is a subset of H_ called the tangent bundle of M. We can define an affine
connection in 1(M) by introducing an affine correspondence between neighboring
T u and Tu,. When an affine connection is given in
H(M) such that rH,
u corresponds to r + drH + . , it naturally induces an affine connection in J(M) such
that rT (M)CH corresponds to the orthogonal projection of r + d r H u + d u to
T +H (M). It can easily be shown that the geometry of M is indeed that of J(M),
so that the -connection of (M) or M, which we have defined in Chapter 2, is
exactly the one which the present -connection of H_(M) naturally induces.
Hence, the -geometry of H_(M) is a natural extension of that of M.
Let u = u(t) be a curve in M. A vector field r ( ' t ) H u ( t )
defined
( ) r =r . ^ E u [ r ]
^ r i =0
(5.5)
62
Shun-ichi Amari
I I
on the curve u(t) along which the parallel shift takes place. When and only
when the curvature of the connection vanishes, the shift is defined independently of the curve connecting u and u 1 . We can prove that the curvature of h[(M)
always vanishes for = 1 connections, so that the e-parallel shift ( = 1) and
the m-parael shift ( = - 1) can be performed from a point u to another point
u1 independently of the curve. Let ^
shift operators from u to u'. Then, we can prove the following important
theorem.
Theorem 5.1. The exponential and mixture connections of h[(M) are
curvature-free. Their parallel shift operators are given, respectively, by
(e)
/ r ( x ) = r(x) - E u ,[r(x)] ,
(5.6)
.
(5.7)
<r,s>u = < ^ . K / s V .
where <.,.>
Proof. Let c: u(t) be a curve connecting two points u = u(0) and u1 = u(l).
Let r ^ ( x , t ) be an -parallel vector defined
define along the curve c. Then, it
satisfies (5.5). When = 1, it reduces to
Since the right-hand side does not depend x, the solution of this equation with
e
E u ( t ) [r( >(x,t)] = 0
63
as
a(t) = - E u ( t ) [r(x)] .
This yields (5.6), where we put u(t) = u 1 . Since E ,[r(x)] does not depend on
the path connecting u and u 1 , the exponential connection is curvature free.
Similarly, when = -1, (5.5) reduces to
f ( m ) (x,t) + r (m) (x,t)(x,u(t)) = 0 .
The solution is
r (m) (x,t)q(x,u(t)) = a(x) ,
which yields (5.7). This shows that the mixture connection is also curvature
free. The duality relation is directly checked from (5.6) and (5.7).
We have defined the imbedding -curvature H ? of a curved exponential family. The concept of the imbedding curvature (which sometimes is called
the relative or Euler-Schouten curvature) can be defined for a general M as
follows. Let PN be the projection operator of H to N which is the orthogonal
subspace of T (M) in H . Then, the imbedding -curvature of M is a function in
x defined by
ab
u 3g b
' '
Exponential bundle
Given a statistical model M = {q(x,u)}, we define the following
elements in H ,
X1
~\
"
n I y
a X> \
2ab
i \
, U /
3a
lb '
= v(> X
X
AI
ka.. .3.
3a
k a ^ . . .ai,
f k^
and attach to each point uM the vector space T v 'J spanned by these vectors,
64
Shun-ichi Amari
(5.9)
(2)
The space T ' , which we will also more briefly denote by r ',
is spanned by vectors X, and X 2 , where X-, consists of m vectors
X.(x,u) = 3 (x,u), a = l,...,m
u
ab
2)' is
The
tensor of |
The metric
then giveri b y
g
Here
ij
denotes an m x mmatrix
<X
X >
9n = a ' b = E[9 13 b i]1 = g a b
Differential
Geometrical Theory of S t a t i s t i c s
65
Figure 8
which is the metric tensor of the tangent space T (M) of M. The component
g
21
12
re
Presents
921 W
22 =
<X
abfXcd> '
(2)
X Idu = Xa(x,u)du a T
u
I I
l J
al = a1 = a 6 3' lz = 9]5E[^^i(x,u)l
(5.10)
We remark again that the index i = 1 stands for a single index b, for example,
and i = 2 stands for a pair of indices, for example b, c.
66
Shun-ichi Amari
the last term X,du corresponds to the change in the origin, we have the following equation
1 + r \ j a + u V = 0 .
aj
(5.11)
Note that we are here talking about the parallel shift of a point in
affine spaces, and not about the parallel shift of a vector in linear spaces
where the origin is always fixed in the latter case.
v
Let u1 be a point close to u. Let (u' u) be the point in T(2)
'
in)
corresponding to the origin (u') = 0 of the affine space T ,'. The map depends
in general on the curve connecting u and u 1 . However, when |u' - u| is small,
the point (U' U ) is given by
^u' u) = j(u'-u) + \ j (u'-u) 2 + 0(|u'-u| 3 ) .
Hence, if we neglect the term of order |u'-u| , the map does not depend on the
route. In the component form,
a(u';u) = u' a -u a
2(u';u) = bc (u';u) = \ (u' b -u b )(u' C -u C ) ,
(5.12)
where we neglected the term of order |u'-u| . Since the origin (u') = 0 of
T (2)
,; can be identified with the point u' (the distribution q(x,u')) in the model
M, this shows that, in the neighborhood of u, the model M is approximately re;
presented in (2)
as a paraboloid given by (5.12).
Let us consider the exponential family E = {p(x,;u)l depending
on u, whose density function is given by
(5.13)
(2)
ine space "
where is the natural parameter. We can identify the affine
with
i (2)
(2)
the exponential family E u , by letting the point = X^j ' represent the
67
Figure 9
distribution p(x,;u)Eu specified by . We call E the local exponential
family approximating M at u. The aggregate
with suitable topology is called the fibre bundle of local exponential family of
degree 2 of M. The metric and connection maybe defined from the resulting identification of E!(M) with j} '(M). The distribution q(x,u) exactly corresponds to
the distribution p(x,0;u) in E u , i.e., the origin = 0 of E or V '. Hence,
the point = (u' u) which is the parallel shift of (u') = 0 at E ,, is the
counterpart in E of the q(x,u')M, i.e., the distribution p{x,(u',u); u}E
is an approximation in E of q(x,u')M. For a fixed u, the distributions
\ = {q(x,u';u)} ,
q(x,u' u) = p{x,(u' u ) ; u}
form an m-dimensional curved exponential family imbedded in E (Fig. 9). The
point of this construction is that M is approximated by a curved exponential
family M in the neighborhood of u. The tangent spaces T (M) of M and T (M )
of M exactly correspond at u, so that their metric structures are the same at
u. Moreover, the squares of the imbedding curvatures are the same for both M
and P(u at u, because the curvature is obtained from the second covariant
68
Shun-ichi Amari
(N)
we can
define the
n.,.(u) = X.(u) = i j X i ( x ( j ) , u ) .
(5.14)
_ 1fr(mh2 +
3ab " 2 (
'ab
69
(e),2+ ( 1
# (mh2
M >ab 2 H A ] ab '
(H H
tion is performed in E , we can easily prove that Theorems 5.2 and 5.3 hold,
because the Edgeworth expansion of the distribution u is determined from the
expansion of A ( X , U ) up to the second order if the bias correction is used. However, we do not know the true u Q , so that the estimation is performed in EQ.
In order to evaluate the estimator u, we can map E- (and M-) to M by the
u
u
u0
exponential connection.
u'= (X(i)
( N ))
is given, we can construct the related estimator given by the solution of
re
e f (X/x;u) = u, where
e f (X;u) = E u [f(x ( 1 ) ,...,x ( N ) )|X(u) = X] .
Obviously, e f (X;u) is the conditional expectation of u1 given X(u) = X. By
virtue of the asymptotic version of the Rao-Blackwell theorem, the behavior of
1
70
Shun-ichi Amari
anc
nc
*epen-
and
(Y
Cin
it
- c
f"hp
71
where u may be any point between u-j and u 2 , because the same estimator u is
obtained up to the necessary order by using any E ,. Here, we simply put
u' = (u-j + u 2 )/2, and let be
= (u1 - u 2 )/2 .
^
Then, the point X , . ( U . ) is shifted to X/ \(u') of E , as
l " 2 ( X 2 a b X l a b ) + 2 b c a 6
"" 1
X
2 = 2 ( X 2ab + X lab )
'
=u
which indeed coincides with that obtained by the equation e(u) = u up to the
third order. Therefore, the estimator u is third-order efficient, so that it
coincides with the m.l.e. based on all the 2N observations up to the necessary
order.
The above result can be generalized in the situation where k
asymptotically sufficient statistics ( u ^ X / ^ ^ ) a r e given in EQ , i = l,...,k,
u. being the m.l.e. from N independent observations. Let
u1 = N i u i /N i .
Moreover, we define the following matrices
72
Shun-ichi Amari
ab -I iab '
=K W "
u" = G a b [ , G. b c u]
is third-order efficient.
This theorem shows that the best estimator is given by the weighted
average of the estimators from the partial samples, where the weights are
given by G. . . It is interesting that G. , is different from the observed
Fisher information matrix
J
Tab
6.
for all allowable and for any estimator ' C. Obviously, there does not
necessarily exist a best estimator in a given class C.
Now we restrict our attention to some classes of estimators. An
estimator is said to belong to class C Q , when it is given by the solution of
the equation
73
74
Shun-ichi Amari
N
. y ( ) = o ,
where y(x,) is a function of x and only, i.e., it does not depend on . The
function y is called the estimating function.
E j [8 y(x,)] + 0 ,
i
Let H
(M) <= H
g = g - v
where g
2 /
= <w2>
'
2
2
= <u >, g ^ = <v >, g ^ = <u,v> are the components of Fisher informa-
tion matrix.
" 9
75
(Kumon and Amari, 1984) when its estimating function y(x,) can be decomposed as
y(x,) = w(x;,) + n(x;,) .
The class of the uniformly informative estimators is denoted by C..,. A uniformly informative estimator satisfies
<y,w> f = < w 2 > j = g (, ) .
Let Cjy be the class of the information unbiased estimators introduced by
Lindsay [1982], which satisfy a similar relation,
<VW>
= <V >
y
y
' w ,
,
= g"1 + g" 2 g .
We go further beyond this theory by the use of the Hubert bundle theory.
6.2.
*>
Shun-ichi Amari
<
= U , { ( m ^ , a ( x ) I a(x)T j l } ,
i.e., the subspace composed of all the m-parael shifts to (,) of the vectors
belonging to the tangent space T
H
where H.
subspace at (,). We next define the nuisance subspace H^r at (,) spanned
by the m-parallel shifts
decomposition
I is the orthogonal complement of H
where H>
inNT
HQ,_. It is called the
,
information subspace at (,). Hence,
, r H
N
r
(6.2)
parts of r.
We now define some important vectors. Let us first decompose the
-score vector u = 3 A &T n
(6.3)
, from (,*) to
,
. When
g l l U ^ x . ' M
77
for which
<wI,GI(x;,;')> = g (')
(6.4)
holds for any 1 . The vector w (w;,)HQ r is called the information vector.
When it exists, it is unique.
6.3. Existence theorems and optimality theorems
It is easy to show that a field r(x;,) is se-invariant if its
identically Hence, any estimating function
nuisance partt rN vanishes identically.
y(x,)C, is decomposed into the sum
y(x>) = y ^ x .) + y(x;,) .
We can prove the following existence theorems.
Theorem 6.3. The class C, of the consistent estimators is nonempty
if the information subspace H A _ includes a non-zero vector.
Theorem 6.4. The class Cy, of the uniformly informative estimators
in C, is nonempty, if (x;,;') are coplanar. All the uniformly informative
estimators have the identical I-part y (x;,), which is equal to the information vector w (x;,).
Outline of proof of Theorem 6.3. When the class C, is nonempty,
there exist an estimating function y(x,) in C-.. It is decomposed as
y(x,) = y ^ x .) + y(x;,) .
Since y is orthogonal to the tangent space H
we have
<y,u> = 0 .
By differentiating <y(x,)> = 0 with respect to , we have
0 = <ay> + <y,u>
Shun-ichi Amari
By differentiating <a>
. This proves
Hence, the above y(x .') does not depend on ' so that it is an estimating
function belonging to C j.
or
<y I (x;,), iiI(x;,;l)>= g Q (') .
This shows that are coplanar, and the information vector w is given by
projecting y to H
y(,e) =( e ) ^ V ,
which yields an estimating function belonging to Cyj.
The classes C, and Cyj are sometimes empty.
example later.
We will give an
Even when they are nonempty, the best estimators do not neces-
sarily exist in C-. and in Cx... The following are the main theorems concerning
best estimators.
The best
79
a H
_ is orthogonal to u
in H Q r .
,
as
A V [ ; ] = lim N { ( C2 A. + B . ) } / { ( C . A . ) 2 } ,
N^co
+ <(y)2>
From this, we can prove that, when and only when B. = 0, the estimator is
uniformly best for all sequences . The best estimating function is u
for = (,,, . . . ) . Hence it is required that u
is se-invariant.
(x;,)
This
proves Theorem 6.5. The proof of Theorem 6.6 is obtained in a similar manner
by using w
6.4.
instead of u .
(6.5)
When
u = afts + a r - a ,
v = s - 3 .
80
Shun-ichi Amari
( m
^.
, =
{f
(6.6)
is given by
(a^) = O ^ ) = 0 .
On the other hand, a consistent estimator does not necessarily exist
in general. We give a simple example: Let x = (x-^Xp) be a pair of random
variables taking on two values 0 and 1 with probabilities
P(x! = 0) = 1/(1 + exp + }) ,
P(x 2 = 0) = 1/(1 + exp{k()}) ,
where k is a known nonlinear function. The family M is of n-exponential type
only when k is a linear function. We can prove that H
= {0}, unless k is
linear. This proves that there are no consistent estimators in this problem.
Now we can obtain the best estimator when it exists for
n-exponential family. The I-part of the -score u is given by
81
this case.
According to Theorem 6.4, in order to guarantee the existence of
uniformly informative estimators, it is sufficient to show the coplanarity of
(x;,;'), which guarantees the existence of the information vector
w(x;,)H* r. By putting w = hsMs.s) 1 + f(s)(3 r ) 1 , this reduces to the
integral-differential equation in f,
<w,'(3 s) 1 + (3 r) I > I = g (') .
(6.7)
When the above equation has a solution f(s;,), are coplanar and the information vector w exists. Moreover, we can prove that when (3 Q r) = 0, the
information vector w is e-invariant.
Theorem 6.9. The best uniformly informative estimator exists when
(3 r) = 0. The best estimating function is given by solving
(6.8)
where h(s ) does not depend on and V[a s | s] is the conditional covariance.
We give another example to help understanding.
Let x = (x-^Xp) be
From 3s = x 0 , dr = 0, we have
= (x2 - x^/d + 2 ) ,
(Sg)1 = 0.
82
Shun-ichi Amari
Hence, from Theorems 6.7 and 6.8, the class C, is nonempty, but the best
estimator does not exist in C-.. Indeed, we have
u J (x;,) = (x 2 - X-, )/( + 2 ) ,
which depends on so that it is not e-invariant. Since any vector w in H
can be written as
w = h(s)(3 s) 1
for some h(s;,), the information vector w (x;,)H
can be obtained by
7.
-representation of spectrum
Let M be the set of all the power spectrum functions S() of
zero-mean discrete-time stationary regular Gaussian time series, S() satisfying the Paley-Wiener condition,
Nog S()d > - oo .
Stochastic properties of a stationary Gaussian time series x t h t = ..., -1, 0,
1, 2, ..., are indeed specified by its power spectrum S(), which is connected
with the autocovariance coefficients c. by
C. = 27 f S() COStd ,
(7.1)
(7.2)
where
c
t=
E[x
rVt]
r - {S()
{S() ,
(7.3)
log S() ,
83
= 0 .
84
Shun-ichi Amari
2 . i COSt ,
(7.4)
where
.
= \SL
()cStd ,
t = 0 1, 2, ....
S(; ) = J
(7.5)
= 0.
- c+
are hence common to both the parametric and non-parametric models, irrespective
of the dimension n of the parameter space.
We can introduce a geometrical structure in M or M in the same
manner as we introduced before in a family of probability distributions on
D i f f e r e n t i a l Geometrical Theory of S t a t i s t i c s
85
9ab (u)= l im TE [ 3 a W ]
-*
Ull- j E [ { W-WbVV ]
However, the limiting process is tedious, and we define the geometrical structure in terms of the spectral density S() in the following.
Let us consider the tangent space T at u of M or M , which is
spanned by a finite or infinite number of basis vectors 3 = a/3u associated
a
with the coordinate system u. The -representation of 9 is the following funca
tion in ,
Hence, in M, the basis d[a' associated with the -coordinates ^ is
1 ,
t= 0
2C0St ,
t =f 0 .
ye>
Shun-ichi Amari
d u e
aD
It is calculated as follows.
Theorem 7.2. The -divergence from S, to Sp is given by
/ 2 ) Jf {[SI9 ()/SI ()] - 1 - lg[S29 /S1 ]}d , j 0
(1/2) f [log S () - log S 9 ()] 2 d ,
= 0.
K"~ I
Jo Vt-k =t
where {.} is a white noise Gaussian process with unit variance and a = (a Q ,
a.,...,a ) is the (n+1)-dimensional parameter specifying the members of M n .
Differential
Geometrical Theory of S t a t i s t i c s
87
IV
i ki-2
S( a) = | k ^ 0 ake
cesses
n
x
k=0 b k t-k
S( b) = | b k e
|.
n
c. = . a.a. . ,
k = 0, l,...,n
c. = 0 ,
k > n
where
88
Shun-ichi Amari
nR}'{MnA}
and { M
nXP}
are nested
nMn
(7.8)
nMn
n ^ = Jn - A + A >
(7 9)
Figure
89
10
Hence,
= JoD-c
The theorem is proved by the Pythagorean relation for the right
triangle S S S Q composed of the -geodesic S S Q included in MJ and --geodesic
SS intersecting at S perpendicularly. The theorem shows that the approximation error E (S) is decomposed into the sum of the --divergences of the
successive approximations S. , k = n+ ,...,>, where S^ = S is assumed. Moreover, we can prove that the --approximation of S. in M (n < k) is S . In
K
n
n
other words, the sequence {S } of the approximations of S has the following
property that S is the best approximation of S.K (k > n) and that the approximation error E (S) is decomposed into the sum of the --divergences between the
further successive approximations. This is proved from the fact that the 1
geodesic in M connecting two points S and S belonging to M" is completely included in MJ for an -model M .
Let us consider the family {M } of the AR-models. It coincides
with M . Let S be the -1-approximation of S. Let c.(S) and c.(S) be, respectively, the autocovariances and inverse autocovariances. Since c. and c^
are the mutually dual -1-affine and 1-affine coordinate systems, the -1-approx-
90
S h u n - i c h i Amari
ct(Sn) = ct(S),
t = 0, 1 , . . . , n
2)
ct(Sn) = 0 ,
t = n+1, n+2, . . . .
REFERENCES
a geometrical foundation of
a differential geometrical
(1982b). Geometrical theory of asymptotic ancillarity and conditional inference. Biometrika 69, 1-17.
Amari, S.
Amari, S.
92
Shun-ichi Amari
Informa-
93
94
Shun-ichi Amari
a higher order asymptotic theory in multiparameter curved exponential family, METR 85-2, Univ. Tokyo.
Lauritzen, S. L. (1987). Some differential geometrical notions and their use
in statistical theory.
a generalization
1.
Introduction
97
2.
99
3.
Transformation Models
118
4.
Transformation Submodels
127
5.
130
6.
Observed Geometries
135
7.
Expansion of c | j |^L
8.
152
9.
Appendix 1
154
10. Appendix 2
156
11. Appendix 3
157
12. References
159
147
95
1.
INTRODUCTION
98
0. E. Barndorff-Nielsen
parameter of the model, is minimal sufficient. The observed geometries and the
closely related expansion of c|j|^L form a parallel to the "expected geometries"
and the associated conditional Edgeworth expansions for curved exponential
families studied primarily by Amari (cf., in particular, Amari 1985, 1986), but
with some essential differences. In particular, the developments in sections 6
and 7 are, in a sense, closer to the actual data and they do not require integrations over the sample space; instead they employ "mixed derivatives of the
log model function." Furthermore, whereas the studies of expected geometries
have been largely concerned with curved exponential families the approach taken
here makes it equally natural to consider other parametric models, and in particular transformation models. The viewpoint of conditional inference has been
instrumental for the constructions in question. However, the observed geometrical calculus, as discussed in section 6, does not require the employment of
exact or approximate ancillaries.
The observed geometries provide examples of the concept of
statistical manifolds discussed by Lauritzen (1986).
Throughout the paper examples are given to illustrate the general
results.
on both the observation x and the parameter and we shall call any such function a combinant.
Jacobians. Our vectors are row vectors and we denote transposition of a matrix by an asterix *.
If f is a differentiate transformation of
a space _Y then the Jacobian af/ay* of f at yY_ is also denoted by Jr (y), while
we write J f (y) for the Jacobian determinant, i.e. J f = |JJ . When appropriate
we interpret J f (y) as an absolute value, without explicitly stating this. We
shall repeatedly use the fact that for differentiate transformations f and g
we have
Jf
o g
( y ) = Jg(y)J f (g(y))
(2.1)
Jf
0 g
(2.2)
and hence
99
100
0. E. Barndorff-Nielsen
Foliations. A partition of a manifold of dimension k into submanifolds all of dimension m<k is called a foliation and the submanifolds are said
to be the leaves of the foliation.
A dimension-reducing statistical hypothesis may often, in a natural
way, be viewed as a leaf of an associated foliation of the parameter space .
Likelihood. We let L = L() = L( x) denote an arbitrary version
of the likelihood function for and we set 1 = log L. Furthermore, we write
3 r = 3/3, and 1 = a 1, 1
matrix
j() = -[l r s ]
(2.3)
i() = E j().
(2.4)
The inverse matrices of j and i are referred to as observed and expected formation, respectively.
Suppose the minimal sufficient statistic t for M^ is of dimension k.
We then speak of M as a (k,d)-model (d being the dimension of the parameter ).
Let (,a) be a one-to-one transformation of t, where is the maximum likelihood estimator of and a, of dimension k-d, is an auxiliary statistic.
In most applications it will be essential to choose a so as to be
distribution constant either exactly or to the relevant asymptotic order. Then
a is ancillary and according to the conditionality principle the conditional
model for given a is considered the appropriate basis for inference on .
However, unless explicitly stated, distribution constancy of a is
not assumed in the following.
There will be no loss of generality in viewing the log likelihood
1 = l() in its dependence on the observation x as being a function of the
minimal sufficient (,a) only. Henceforth we shall think of 1 in this manner
and we will indicate this by writing
1=1(,,a).
101
(2.5)
(2.6)
(2.7)
(2.8)
uniformly in for /(-) bounded, where n denotes sample size. This holds,
in particular, for (k,d) exponential models. For more details and further
102
0. E. Barndorff-Nielsen
(2.9)
3/2
l = 2?{-3U4
12U
3,1 " 5 U 3
+ 24U
(2
10
>
and
^
A 2 (u) = P 3 (u)U 2 > 2
1 2 j l
+ P 2 (u)U 3
P 4 (u)| f l + P 5 (u)U 4 + P 6 ( u ) U 3 J
P 7 (u)U 2
where P (u), i = 1,...,8, are polynomials, the explicit forms of which are
given in Barndorff-Nielsen (1985), and where U =
v
/ x
v
,U, /( )x _ 9 s {r v ; (;,a)}
v,s
.(v+s)/2
*
s
= 0,1,2...
v+s
), respectively.
(2.11)
Differential
and I n t e g r a l Geometry in S t a t i s t i c a l
Inference
103
/
|j|^2d
Iw , a
(2.12)
frame
inference
procedure
procedure
conclusion
> conclusion
reparametrization
104
0. E. Barndorff-Nielsen
Lv]
dr
[i]()
(D()/i()^
il/i) 3 5 ,
v = 2,3,...,
2 r, p
/per
etc. Furthermore, we write l() for the log likelihood under the parametriza-
105
tion by , though formally this is in conflict with the notation l(), and
correspondingly we let 1 = 9 1 = 9 l(), etc.; similarly for other parameter
dependent quantities. Finally, the symbol
the maximum likelihood estimate has been substituted for the parameter.
Using this notation and adopting the summation convention that if a
suffix occurs repeatedly in a single expression then summation over that suffix
is understood, we have
1 = 1 ,
P
r /p
1
= 1 J. S. + 1 J.
p
1
=1
p
rs /p /
(2.13)
v
r /p
'
(2.14)
'
etc., where [3] signifies a sum of three similar terms determined by permutation
of the indices p,,. On substituting for in (2.13) we obtain the wellknown relation
j
( a) = 3-rs(;a) ^ ^
9
Equation (2.15) shows that j is a metric tensor on M, for any given value of the
auxiliary statistic a. Moreover, in wide generality 3- will be positive definite
on M^, and we assume henceforth that this is the case. In fact, for any we
have 3- = j, i.e. observed information at the maximum likelihood point, which is
generally positive definite (though counterexamples do exist).
p
Let A() = [A
()] be an array, depending on and where
s
s
l ' q
each of the p + q indices runs from 1 to d. Then A is said to be a (p,q)
106
0. E. Barndorff-Nielsen
.p
s-,
p-j
r ,.. .r
F J
3i
K l
"
Low
/q
and B
S
1S2
tl
t 2 ...
This product is again a tensor, of rank (p1 + p", q' + q") if (p',qf) and
(p",q") are the ranks of A and B.
Lower rank tensors may be derived from higher rank tensors by contraction, i.e. by pairwise identification of upper and lower indices (which
implies a summation).
The parameter space as a manifold. The parameter space may be
viewed as a (pseudo-) Riemannian manifold with (pseudo-) metric determined by
a metric tensor , i.e. is a rank 2 covariant, regular and symmetric tensor.
o
The associated Riemannian connection v is determined by the Christoffel symbols
Of
where
tu
?* = ?
rs
rsu
and
then
107
s 3 t
On the other hand, any set of functions [r ] which satisfy the law (2.18)
constitute the connection symbols of an affine connection on . It follows that
all affine connections on are of the form
where the S
L = r + s
rs
rs rs
are characterized by the transformation law
(2 19)
\^-^)
S p T () = S s () / r p ^; t .
(2.20)
r st
rst = rs tu a n d S rst = S rs tu
rst
/p / /
tu
/p /
'
rst rst
\t-")
S^^./ # , .
rst /p / /
(2.23)
'
rst
and
=
S
p
ab
(a) =
rs
()
/a /b '
(2
'
24)
108
0. E.
Barndorff-Nielsen
(2
'25)
G.
(2.26)
gG.
(2.27)
(On the left hand side the tensor is expressed in coordinates, on the right
hand side in coordinates.) Similarly, a connection r is said to be invariant
1f
rj s () = rj $ (g),
gG.
(2.28)
VV
If r
)=
9 ) 9G>
VV '
1
?*rs = rrs
+ t ursu
S
is a G-invariant connection.
Now, let be the information tensor i on . Then (2.16) takes the
form
?
rst E { V s V -
Obviously,
= E<y$lt}
(2.29)
109
satisfies (2.23) and hence, for any real an affine connection is defined by
?
rst=ElrsV+E{1V
<2 3 0 >
These are the -connections introduced and studied by Chentsov (1972) and
Amari (1982a,b, 1985, 1986).
However, we shall be mainly concerned with another type of connection, determined from observed information, more specifically from the metric
tensor 3-, see sections 6-8. We refer to i and 3- as expected and observed information metric on i, respectively.
Suppose, as above, that :$ -* is an immersion of B in . The
submodel NL of ^ obtained by restricting to lie in = (B) has expected
information
i(3) = ^ H ) ^
(2.31)
Thus 1(3) equals the Riemannian metric induced from the metric i() on to
the imbedded submanifold Q . Furthermore, the -connection of the model ML
equals the connection on Q induced from the -connection on , by the general
construction (2.25).
The measures on defined by
li^d
(2.32)
\3\
(Z.33)
and
are both geometric measures, relative to expected and observed information
metric, respectively.
sures that they are parametrization invariant. This property follows from
the fact that i and a" are covariant tensors of rank 2. As a consequence we
have that c|j| L (of (2.7)) is parametrization invariant.
Invariant measures. A measure on x is said to be invariant with
respect to a group G acting on X^ if gy = y for all gG.
110
0. E. Barndorff-Nielsen
d(x) = J ^ J O J W ) .
(2.34)
Here J / % denotes the Jacobian determinant of the mapping (g) of X. onto itself
determined by gG and (z,u) constitutes an orbital decomposition of x, i.e.
(z,u) is a one-to-one transformation of x such that uX^ and u is maximal
invariant while zG and x=zu. For a more detailed discussion see section 3
and appendix 1.
Transformation models. Let G be a group acting on the sample space
)L
invariant under the induced action of G on the set of all probability measures
on X then the model is called a composite transformation model and if IP
111
(2.35)
Here k is the order of the model (2.35) and is equal to the common dimension
of the vectors () and t(x), while d denotes the dimension of the parameter .
The full exponential model generated by (2.35) has model function
p(x ) = exp t(x) - () - h(x)}
(2.36)
112
0. E. Barndorff-Nielsen
i.e. is the mean value parameter of (2.36), and we write T for (int)
where denotes the canonical parameter domain of the full model (2.36).
Let f be a real differentiate function defined on an open subset
k
of R .
where
y = (Df)(x) =fj(x) .
The Legendre transform is a useful tool in studying various, dualistic aspects
of exponential models (cf. Barndorff-Nielsen (1978a), Barndorff-Nielsen and
Blaesild (1983a)).
In particular, we may use the Legendre transform to define the
dual likelihood function 1 of (2.35) by
-1
1 () = e () - l(()).
(2.37)
Here, and elsewhere, ' as top index indicates maximum likelihood estimation
under the full model. Further, in this connection we take 1 as the sup-loglikelihood function of (2.36) and then 1 is, in fact, the Legendre transform of
K.
(2.38)
x log x,
f (x) = - ^ { - x
1-
-log X,
113
= 1
(1+)/2
},
-l<<l .
(2.40)
= -1
-1
l() = -I(,) = - f() - ()
(2.41)
(2.42)
-*l*) +2^() - ( ^ + ^ ) }
4
K) = -K [e 2
2
2
_>
-
are then, in the present terminology and for = 1, leaves of the maximum likelihood foliation.)
Exponential transformation models. A model M^which is both transformational and exponential is called an exponential transformation model. For
such models we have the following structure theorem (Barndorff-Nielsen,
Blaesild, Jensen and Jorgensen (1982), Eriksen (1984b)).
Theorem 2.1. Let M be an exponential transformation model with
114
0. E. Barndorff-Nielsen
acting group G. Suppose X^ is locally compact and that t is continuous. Furthermore, suppose that G is locally compact and acts continuously on X_.
Then there exists, uniquely, a k-dimensional representation A(g) of
G and k-dimensional vectors B(g) and B(g) such that
t(gx) = t(x)A(g) + B(g)
(2.43)
(2.44)
where eG denotes the identity element. Furthermore, the full exponential model
generated by M_ is invariant under G, and &* = {[A(g~ )*,&(g)]: gG} is a group of
affine transformations of R leaving and into invariant in such a way that
(gP) = (P)A(g'V + B(g),
gG, PlP .
(2.45)
a((gP)) = a((P))(g)"1exp(-(gP).B(g)).
(2.46)
We then have
115
(2.47)
is positive definite. We shall then say that we have a maximum estimation procedure. Maximum likelihood estimation and dual maximum likelihood estimation
-1
(where m() = l() = () - l(), cf. (2.37)) are examples of this. More
116
0. E. Barndorff-Nielsen
(2.48)
X
4
a i
CD
r-4 ^- O
0)
CO
H
H
O
-t
CQ
CD u
CO
0
>
id
CO
H
U
H
en
n
x
II
P -H
CO
CO
o
0 Q)
O
H
g -P 4J
o
en
O
in
O
CO
11 If
CD
ft
5
o
rH CO
H W H
M V4
+J TJ P
CQ U
<D
H
4J
CO
117
3. TRANSFORMATION MODELS
119
i.e.
u(gx) = u(x),
z(gx) = gz(x).
cosets of K.) Note that G p = hG p h" , and that the action of G on P is free if
120
0. E. Barndorff-Nielsen
and only if K consists of the identity element alone. The quantity h parametrizes _P.
Suppose G = HK is a factorization of this kind. For most transformation models of interest, if the action of G on X. is not free then there exists
an orbital decomposition (z,u) of x with zH and such that for every u the isotropy group G equals K and, furthermore, if z and z 1 are different elements of
H then zu f z'u.
Example 3.1. Hyperboloid model. This model (Barndorff-Nielsen
(1978b), Jensen (1981)) is analogous to the von Mises-Fisher model but pertains
to observations x on the unit hyperboloid H^"1 of R k , i.e.
The analogue of the orthogonal group 0(k) is the so called pseudoorthogonal group 0(1,k-1), which is the subgroup of GL(k) with matrix representation
0(1,k-1) = {U:ll* 1 U = 1}
where t denotes the k x k diagonal matrix
1 0
0 -1
. -1
Differential
121
SO (l,k-l)
(vector-matrix multiplication).
The points of H
bolic-spherical coordinates as
XQ = cosh u
x. = sinh u cos v,
x 2 = sinh u sin v, cos v ?
specified by
d = sinh k " 2 u sin k " 3 v 1 ... sin v k _ 3 dudv ] ... dv k _ 2
(3.1)
The hyperboloid model function, relative to the invariant measure
(3.1) on H k ~ \ is
p(x;,) = a k ()e" * x
(3.2)
where the parameters and , called the mean direction and the precision,
satisfy Hk-1 and >0, and where
a k () =
k/2
- /{(2)
k/2
' 2K k / 2 _ 1 ()}
(3.3)
122
0. E. Barndorff-Nielsen
Vl
X
1 + l+x
1X2
Vk-1
1+X
h=
1+
Vl
k-1 x 1
1+Xn
k-l x 2
1+x 0
2 X k-1
(3.4)
. 1 + k-1
1+Xn
for s = s(x) and for any x_X, and we speak of this as the action induced by s.
In the applications to be discussed later S is typically the parameter domain
under some parametrization of the model and s is the maximum likelihood estimator, which is automatically equivariant.
We are now ready to state the results which constitute the main
tools of the theory of transformation models.
Subject to mild topological regularity conditions (for details, see
Barndorff-Nielsen, Blaesild, Jensen and Jorgensen (1982)) we have
Lemma 3.1. Let u be an invariant statistic with range space U =
u U ) , let s be an equivariant statistic with range space S = s O O , and assume
that the induced action of G on S is transitive. Furthermore, let be
123
(3.5)
for some functions q and r and some invariant statistic w which is a function
of u.
Then the following conclusions are valid.
(i) The model function p(x g) is of the form
p(x g) = q(u)r(g" s,w),
(3.6)
<p>
(3.7)
(3.8)
1 2 4
0. E. Barndorff-Nielsen
(3.9)
125
yR
++
y2
1 +
z=
y1
y,
(3.10)
2
2
y-i + Yo then (u,z) constitutes an orbital decomposition of
measure
dy(y) =
p
Suppose theorem 3.1 applies with S = H and let L(h) = L(h x) be any
12 6
0. E.
Barndorff-Nielsen
<v> .
(3.11)
L Q (s s~ g)
L Q ( S Q )
<> ,
the invariant measure being denoted here by , as a standard notation for left
invariant measure on G. This formula, which generalizes a similar
expression for the location-scale model due to Fisher (1934), shows how the
"shape and position" of the conditional distribution of s is simply determined
by the observed likelihood function and the observed s Q , respectively.
Formula (3.11), however, besides being slightly more general, seems
more directly applicable in practice.
4.
TRANSFORMATIONAL SUBMODELS
If P~ is any
of TG may be represented as
f(" (x.-)).
]
i=l
(4.1)
Here G is the affine group with elements [,] which may be represented by
2 2 matrices
1
JJ
the group operation being then ordinary matrix multiplication. The Lie algebra
of G, or equivalently TG , is represented as the set of 2 x 2 matrices of the
127
128
0. E. Barndorff-Nielsen
form
A=
We have
e t A = I + tA + 2T t 2 A 2 +.
b/a(e t a -l) e t a
where the last expression is to be interpreted in the limiting sense if a = 0.
There are therefore four different types of submodels. Specifically, letting U Q ^ Q )
denote an
' e
(4.2)
c o s h
"sinh*
s i n h
129
may be represented as the subgroup of GL(3) whose elements are of the form
0
cos
-sin
where -><<->.
cosh
sinh
sin j I sinhx
cosh x
COS
, 2
(4.4)
instance, Barut and Raczka (1980) chapter 3) of S0*(l;2) into the product of
three subgroups, the three factors in (4.4) being the generic elements of the
respective subgroups.
It follows that TG
I 0
' 0
1 , E2 = j 1
!
0 - 1 0
1
1 I
-1
third factor in (4.4) yields, when applied to the distribution (4.3) with
X = = 0, the following one-parameter submodel of the hyperbolic model:
p(u,v;)
(2
~(
c o s h
^sinh u
"Snh
c o s
) ~ 2 ^s i n h
aa
b ~i
I
where a, b, c are fixed real numbers.
-c
s i n
= P( g)
is positive definite.
In these circumstances we have:
Proposition 5.1. The maximum estimator ft is an equivariant mapping
130
131
(5.1)
(g)
I (g)
|
G/K
(5.2)
H
P
commutes. Let be the mapping from G to H that sends a gG into the uniquely
determined hH such that g = hk for some kK. For any fr = (x) in H we have
that (g)fr = (gx) is determined by
f({V(g)}" gx) > f(h - 1 gx),
hH.
(5.3)
-1
and here (g" h) ranges over all of H when h ranges over H. Hence (5.3) may be
rewritten as
f(h" ),
hH,
132
0. E. Barndorff-Nielsen
i . e . , by ( i i ) ,
= n(rt(gx))
or,
equivalently,
() =
and this, precisely, expresses the commutativity of (5.2), since p~ (h) = hK.
When the mapping x -> (f,u) is proper the subgroup K is compact
because K = G . Hence there exists an invariant measure on H, cf. appendix 1.
That |tfpdh is such a measure follows from (3.9) and formula (5.10) below.
In particular, then, there is only one action of G on H at play,
namely , and
(g)h = (gh).
(5.4)
,2
*() =*(;tl) = - J L j L ( ; h u ) .
(5.5)
(5.6)
Mjp(h) _ M i
((rh))
^ h ) * (h)
(5.7)
and
2
(5.8)
133
ne',u)\h
(5.10)
and this, by (3.9) and the tensorial nature of *, implies that j-RfJl^d is an
invariant measure on . In connection with formula (5.10) it may be noted that
J
'(h) ( e ) = J ( h ) ( e )
where 6 denotes left action of the group G on itself. A proof of this latter
formula is given in appendix 2.
Secondly, the tensor -K() is found to be G-invariant, whatever the
value of the ancillary.
Consequently
- (g)
and this together with (5.6) and (2.26) establishes the invariance.
In particular, observed information ^determines a G-invariant
Riemannian metric on the parameter space. The expected information metric i
can also be shown to be G-invariant.
From proposition 5.1 and corollary 3.1 we find
Corollary 5.1. The model function p*(*,|u) = c|<|\/t' is exactly
equal to p(;|u).
By taking m of (ii) equal to the log likelihood function 1 this
corollary specializes to theorem 4.1 of Barndorff-Nielsen (1983).
Suppose, in particular, that the model is an exponential transform-
134
0. E. Barndorff-Nielsen
ation model. Then the above theory applies with m() = l(). The essential
-j
property to check is that l(;t(x)) is of the form f(h x). This follows simply
6.
OBSERVED GEOMETRIES
In applications to
r s
= 9
rr..rp,s..sq
\ h
(6J)
r r . . rp s r . . sq
and refer to these quantities as mixed derivatives of the log model function.
The function of and a obtained from (6.1) by substituting for will be
. Thus, for instance,
denoted by \
r..rp;s..s
=
*rs;t
(;a)
136
0. E. Barndorff-Nielsen
-? = -f(;a) = g(;,a).
\
r s
s
r..rpSs..sq
So are the terms of an asymptotic expansion of (2.7), cf. section 7.
Employing the notation established above we have 9.6- = -*c + -Jc. +9 etc.
u rs
rsu rs,u
so that
= _(}
+ > . t [3])
(6.4)
(6
' 5)
/p / /
rS /P /
[3]
'
= +
rs;t /p / /
we find
V,t /P /'
(6
'
137
*rs
or
rs ~ * *Vsu
with
In particular, we have
1
-1
V,rs
(6J1)
- 1 1
rts str
- 1
str rts
and
a
l.l
"1
analogy between r and -F becomes more apparent by rewriting the skewness tensor
(2.29) as
T
rst = "
E{1
rst
rs + V s
}=
'
(6J2)
138
0. E. Barndorff-Nielsen
has that, in broad generality, the observed geometries converge to the corresponding expected geometries as the sample size tends to infinity.
For (k,k) exponential models
p(x ) = a()b(x)e # t ( x )
(6.13)
j- = i and = r, R.
Let i,j,k,... be indices for the coordinates of , t and , using
upper indices for and lower indices for t and .
In the case of a curved exponential model (2.35), we have
\ = (t-).}
(6.14)
and, letting denote the maximum likelihood estimator of under the full model
generated by (2 35), the relation +
= ? takes the form
r, s rs
V , s ( ) = ij ( ) jr*/s
" 1J ( > /r /s - (*-^i
Furthermore,
-<1jk()/r/s/t
rst
i j
;rs^t=irst
(6.17)
and
( 61 8 )
^ rs-j^/t^/rs-'rsf
and
Wx
} a
'
(6 20)
139
(6
21)
(6.22)
where, for a r = a/ah and a r = a/ahr,
A^ = a s n r (h" 1 l
so that
S
" ~(h)
while
B
st = 3 s V
s;t
st
140
0. E. Barndorff-Nielsen
(6.24)
"~Dt , C
141
^^x.
x; ,
a is a location model.
n { 4 b
"2
a b
"1(
4 a b
"])
.- = n{4b"^ + ab"'(u
j
+ 4ab"')} = -p= -^
ot
142
0. E. Barndorff-Nielsen
by a known coefficient of variation /, has properties similar to those exhibited by example 6.1.
Example 6.2. Location-scale model. Let data x consist of a sample
x,,...,x from a location-scale model, i.e. the model function is
n
p(x;,) = "
x.-
X -
X -
-2
V(a
a g"(a
a g"(a ) n+a2g"(a
and, in an obvious notation,
f '(a,)
-3
yy
143
3{4n
Kao "
Furthermore,
*Wo
Furthermore,
the maximum likelihood estimate (,) of (,) exists uniquely, with probability 1, (a,,) is minimal sufficient and the conditional distribution of (,)
given the ancillary a is again hyperboloidic, as in (4.3) but with u, v and
replaced by , and a. It follows that the log likelihood function is
l(x) = K 5 ;x, 9 a) = -a{cosh cosh - sinh x sinh cos(-)}
and hence
= -F
XXX
XX
=f
= -F
XX
=0
A = a cosh x sinh
x
-F A = -a cosh x sinh ,
x
whatever the value of . Thus, in this case, the -geometries are identical.
We note again that whereas the auxiliary statistic a is taken so
as to be ancillary in the various examples discussed here - exactly distribu-
144
0. E. Barndorff-Nielsen
tion constant in the three examples above and asymptotically distribution constant in the one to follow - ancillarity is no prerequisite for the general
theory of observed geometries.
Furthermore, let a be any statistic which depends on the minimal
sufficient statistic t, say, only and suppose that the mapping from t to (,a)
is defined and one-to-one on some subset T~ of the full range X of values of t
though not, perhaps, on all of ]_. We can then endow the model M^ with observed
geometries, in the manner described above, for values of t in T~. The
next example illustrates this point.
The above considerations allow us to deal with questions of nonuniqueness and nonexistence of maximum likelihood estimates and nonexistence of
exact ancillaries, especially in asymptotic considerations.
Example 6.4. Inverse Gaussian - Gaussian model. Let x( ) and y( )
2
be independent Brownian motions with a common diffusion coefficient = 1 and
drift coefficients >0 and , respectively. We observe the process x( ) till it
first hits a level x>0 and at the time u when this happens we record the value
v = y(u) of the second process. The joint distribution of u and v is then
given by
p(u,v;,)
(6.26)
Now, assume equal to . The model (6.26) is then a (2,1) exponential model,
still with t as minimal sufficient statistic. The maximum likelihood estimate
of is undefined if t^T^ where
145
IQ = it = (,v):x0 + v > 0}
(6.27)
he event t^T^ happens with a probability that decreases exponentially fast with
he sample size n and may therefore be ignored for most statistical purposes.
Defining, formally, to be given by (6.27) even for t^T^ and leting
a = "(;2nxQ,2 n 2 ) ,
here ( ;x) denotes the distribution function of the inverse Gaussian disribution with density function
-(x .) = ( 2 ) " ^ e ^ x " 3 / 2 e - ^ x " 1 + * x >
(6.28)
e have that the mapping t -> (,a) is one-to-one from X = {t = (,v):>0} onto
-oo,+) x (0,oo) and that a is asymptotically ancillary and has the property
hat p*(;|a) =c|j p L approximates the actual conditional density of given
to order 0 ( n ~ 3 / 2 ) , cf. Barndorff-Nielsen (1984).
Letting ( ;x) denote the inverse function of "( ;,) we may
rite the log likelihood function for as
{ ( X Q+
-2
V) - U }
+
yy
nd
= 0
(6.29)
14 6
0. E.
^ =
8n
Barndorff-Nielsen
U ;2nx2,2n2)
^(" W O
1
=s
-1
= -h $
where ~ denotes the derivative of "(x;,) with respect to . By the wellknown result (Shuster (1968))
"(x;,) = ( V - hx'h) +
where is the distribution function of the standard normal distribution, "
7.
EXPANSION OF c l j l ^ L
We shall derive an asymptotic expansion of (2.7), by Taylor expansion of cIjI L in around , for fixed value of the auxiliary a. The various
terms of this expansion are given by mixed derivatives (cf. (6.2)) of the log
model function. It should be noted that for arbitrary choice of the auxiliary
statistic a the quantity c|j|E constitutes a probability (density) function on
the domain of variation of and the expansions below are valid. However,
c|j|[ furnishes an approximation to the actual conditional distribution of
given a, as discussed in section 2, only for suitable ancillary specification
of a.
To expand c|j| L in around we first write E as exp{l-} and
expand 1 in around . By Taylor's formula,
1-1=
VX
(-) ...(-) v (8 ...3
V>>
v=2
ll vv
l)()
1-1
oo
v=2
.
X
O
vv
(-) Sl ...(-) Sp 3. . . . 3 . \
O
..._ .
Consequently, writing for - and 6 "' for (-) (-) ..., we have
147
(7.1)
148
0. E. Barndorff-Nielsen
depend on then
atlog|A| = |A ] 3 t |A|
where a
log|A| = -aVraSV\avvars
asl\Vrs.
It follows that
-" t U { * S + r s t u + + r s t ; u + * r s u ; t + + r s ; t u )
(7.3)
By means of (7.2) and (7.3) we therefore find
/ 2
) dd/2
= (2)
c d (-;al + A ] + A 2 + ...}
(7.4)
where .( a-) denotes the density function of the d-dimensional normal distribution with mean 0 and precision (i.e. inverse variance-covariance matrix) a- and
where
A
l - " V ^ W
\^
^St(+rs;t+ I *rst)
and
A2 = [- 3
1
rs t
r s t M vw u
(7
"5)
s;t
149
*rst;u
|^st)(+uv;w+|+uvw)],
(7.6)
A^ and A 2 being of order On""15) and 0(n ), respectively, under ordinary repeated sampling.
By integration of (7.4) with respect to we obtain
(2) d / 2 c = 1 + C 1 + ... ,
(7.7)
where C-. is obtained from A by changing the sign of A and making the substitutions
rstu
the 3 and 15 terms in the two latter expressions being obtained by appropriate
permutations of the indices (thus, for example, <s r s t u -> j - r s ^ t u + > r t ^ s u +
Combination o f ( 7 . 4 ) and ( 7 . 7 ) f i n a l l y
yields
c | j | ^ L = ( - ; ) { l + A1 + ( A g + C ^ + . . . }
(7.8)
with an error term which in wide generality is of order 0(n-3/2 ) under repeated
sampling.
150
0. E. Barndorff-Nielsen
(7.9)
Cr
rs... = V s ( )
+ . + y^ +
. .r.St,
,.r.St"
/-,1 N
(7J0)
^str
(7.)
T"rn
( ;3") denotes the
(7.12)
151
where
Since
hSt(S';j) = ' W * - / V ^ ]
(7.14)
we find
h r S t ( ' ; ^ r $ t =0
and hence (7.11) reduces to
-1/3
,
c|j|
L = .(
- - y i;a ){l - hht (' j ) - rsL
P . + ...},
u
(7.15)
P
a
^ab ^ h + H xa^a'iK ' ^^us ^-^^ offers some simplification over the corresponding expression provided by the Amari and Kumon paper.
Note that, again by the symmetry of (7.14), if
-1/3
*rst[3] = 0
(7.16)
for all r,s,t then the first order correction term in (7.15) is 0. Furtherex
more, for any one-parameter model M^ the quantity % with = -1/3, can be made
to vanish by choosing that parametrization for which is the geodesic coordinate for the -1/3 observed conditional connection.
8.
(8.1)
(8.2)
(2.35) is an exponential representation of M_ relative to an invariant dominating measure on X^ then b(x) is a modulator.
(ii) The norming constant a((g)) does not depend on g. If in
addition B(g) does not depend on g, which implies that B( ) = 0, then the conditional distribution of h given w is, on account of the exactness of (2.7),
152
p(h;h|w) = c ( w ) | j | * e ( h " l h ) w
153
(8.3)
(8.4)
(8.5)
geometries (^( ;,a),*( ;,a)) are all "proportional" for fixed , with a as
the proportionality factor. The geometric leaves of the foliation of M^, determined as the partition of M_ generated by the index parameter , are thus highly
similar.
APPENDIX 1
(Al.l)
(A.2)
154
155
(AT.3)
To see the validity of (A1.3) one needs only note that for fixed u the mapping
k -> J ,. x(u) is a multiplier on K and since K is compact this must be the
trivial multiplier 1. Actually, (A1.3) is a necessary and sufficient condition
for the existence of an invariant measure on _Y. This may be concluded from
Kurita (1959), cf. also Santal (1979), section 10.3.
APPENDIX 2
sections 3 and 5 ) , let denote the natural action of G on H and let denote
left action of G on itself.
Proof.
Then J '(h)(e) = J ( h ) ^
for
a11
hH#
Writing g
:g -> k
ah
D(h')(g) =
9(h'hk)
ah
3(h'hk)
9k
(h')
(e)
(h')(e)
156
APPENDIX 3
An inversion result
The validity of formula (6.24) is established by the following
Lemma. Let G = HK be a left factorization of the group G with the
associated mapping :g = hk -> h (as discussed in sections 3 and 5). Furthermore, let h1 denote an arbitrary element of H. Then
3n(h~ ] h')*
a(h'"h)*
h=h'
h=h'
(A3.1)
H
where i indicates the inversion g -> g-1'. This diagram of mappings between differentiate manifolds induces a corresponding diagram for the associated differential mappings between the tangent spaces of the manifolds, namely
157
158
0. E. Barndorff-Nielsen
"Hi
>
TG .
Di
D
TH
n(hI-1h)
Acknowledgements
I am much indebted to Poul Svante Eriksen, Peter Jupp, Steffen L.
Lauritzen, Hans Anton Salomonsen and Jorgen Tornehave for helpful discussions*
and to Lars Smedegaard Andersen for a careful checking of the manuscript.
REFERENCES
Amari, S.-I. (1982a). Differential geometry of curved exponential families curvatures and information loss. Ann. Statist. 10, 357-385.
Amari, S.-I. (1982b). Geometrical theory of asymptotic ancillarity and conditional inference. Biometrika 69, 1-17.
Amari, S.-I. (1985). Differential-Geometric Methods in Statistics. Lecture
Notes in Statistics 28, Springer, New York.
Amari, S.-I. (1986). Differential geometrical theory of statistics - towards
new developments. This volume.
Amari, S.-I. and Kumon, M. (1983). Differential geometry of Edgeworth expansion
in curved exponential family. Ann. Inst. Statist. Math. 35, 1-24.
Barndorff-Nielsen, 0. E. (1978a).
Wiley, Chichester.
Barndorff-Nielsen, 0. E. (1978b). Hyperbolic distributions and distributions on
hyperbolae. Scand. J. Statist. 5_, 151-157.
Barndorff-Nielsen, 0. E. (1980). Conditionality resolutions. Biometrika 67,
293-310.
Barndorff-Nielsen, 0. E. (1982). Contribution to the discussion of R. J.
Buehler: Some ancillary statistics and their properties. J. Amer.
Statist. Assoc. 77, 590-591.
Barndorff-Nielsen, 0. E. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika 70, 343-365.
Barndorff-Nielsen, 0. E. (1984). On conditionality resolution and the likelihood ratio for curved exponential families. Scand. J. Statist. 11, 157159
160
0. E. Barndorff-Nielsen
Barut, A. 0. and Raczka, R. (1980). Theory of Group Representations and Applications. Polish Scientific Publishers, Warszawa.
Boothby, W. M. (1975). An Introduction to Differentiate Manifolds and
Riemannian Geometry. Academic Press, New York.
Burridge, J. (1981). A note on maximum likelihood estimation for regression
models using grouped data. J. R. Statist. Soc. B 43, 41-45.
Chentsov, N. N. (1972). Statistical Decision Rules and Optimal Inference.
(In Russian.) Moscow, Nauka. English translation (1982). Translation of
Mathematical Monographs Vol. 53. American Mathematical Society, Providence,
Rhode Island.
1 6 1
Encyclo-
STATISTICAL MANIFOLDS
Steffen L. Lauritzen
1.
Introduction
165
2.
167
3.
177
4.
Statistical Manifolds
179
5.
190
6.
198
7.
203
8.
206
9.
212
215
10. References
163
1.
INTRODUCTION
166
Steffen L. Lauritzen
statistical manifolds of which some (the Gaussian, the inverse Gaussian and
the Gamma) manifolds are of interest because of their leading role in statistical theory, whereas the examples in section 8 are mostly of interest because
they to a large extent produce counterexamples to many optimistic conjectures.
Through the examples we also try to indicate possibilities for discussing
geometric estimation procedures.
In section 9 we have tried to collect some of the questions that
naturally arise in connection with the developments here and in related pieces
of work.
- U * )A
-i
defined
(3) if V is open, : V -> IR m is a homeomorphism, and o " , o
are
being compatible.
2
168
Steffen L. Lauritzen
0 ) ; (U ) = ]-,[m
i)
(p) = (0
11)
(u nn) = {(x\...,x n ,o
o ) , |xi|<e}
, e IR
f g c (P)
TO
Statistical Manifolds
169
For each particular choice of a coordinate system, there corresponds a canonical basis for T (M), with basis vectors being
X(f+g) = X(f)+X(g)
x(fg) = x(f)g+fx(g)
,IR
f,gC(M)
and now we w r i t e
X p (f) = X ( f ) (P)
The vector fields on IA_ are denoted as X_(MJ. _X(M_) is a module over C(Mj: if
f,gC(M), X,YX(M) then
(fX+gY) (h) = fX(h) + gY(h)
is also in X_(M_). X_(M_) is a Lie-algebra with the bracket operation defined as
[X,Y](f) = X(Y(f)) - Y(X(f))
The Lie-bracket [ ] satisfies
[X,[Y,Z]] + [Y,[Z,X]] + [Z,[X,Y]] = 0
[X,Y] = -[Y,X]
[X+3X9,Y] = [X ,Y] + 3[X9,Y]
1
'
, N IR
,BIR
(Jacobi identity)
(anticommutativity)
(binearity)
The locally defined vector fields E., representing differentiation w.r.t. local
coordinates, constitute a natural basis for the module XjU), where U is a
coordinate neighborhood.
A covariant tensor D of order k is a C-k-linear map
D:
170
Steffen L. Lauritzen
i.e.
D(X r ...,X k ) C(M),
D(X r ...,fX i +gY i ,X i + 1 ,...,X k )
= fD(x r ...,x k ) + gD(x 1 ,...,Y i ,x. + 1 ,...,x k ).
A tensor is always pointwise defined in the sense that if X . = Y ., then
D(X r ...,X k )(p) = D(Y r ...,Y k )(p).
This means that any equations for tensors can be checked locally on a basis
e.g. of the form E . These satisfy [E ,E.] = 0 and all tensorial equations hold
if they hold for vector fields with mutual Lie-brackets equal to zero. This is
a convenient tool for proving tensorial equations and we shall make use of it
in section 3.
A Riemannian metric g is a positive symmetric tensor of order two:
g(x,x) > o
g(x,) = g(Y,x)
Statistical Manifolds
171
V(Y+3Z) = VY + |3VZ, ,6 IR
v (fY) = X(f)Y + fv Y
iii)
v f + g Z = fv Z + gv Z .
X(t) T(t)(H)
such that v X = 0 on , i.e. such that these are all parallel, and such that
/ %%is equal to the given one. We then write
XX /
X
Y(b)
by
172
Steffen L. Lauritzen
v^fVg^+fVr^.
A geodesic is a curve with a parallel tangent vector field, i.e. where
V = 0 On .
Associated with the notion of a geodesic is the exponential map induced by the
connection.
For all pM_, X T (M_) there is a unique geodesic Y , such that
P
P
p
Xy (0) = p
(0) = X n (**)
X
p
p
P
This is determined in coordinates by the differential equations below together
with the initial conditions (**)
x k (t) + x ^ t ^ U K ^ W t ) ) = 0
where X Y (t) = (x (t),...,xm(t)) in coordinates.
P
Defining now for X T (M_)
exp{X p } = (1)
we have exptX } = Y (t).
X
p
P
The exponential map is in general well defined at least in a neighborhood of zero in T (M.) and can only in special cases be defined globally.
In general, geodesies have no properties of "minimizing" curve
length. However, on any Riemannian manifold, (i.e. a manifold with a metric
tensor g ) , there is a unique affine connection v satisfying
i)
v Y - v X - [X,Y] 0
)
S t a t i s t i c a l Manifolds
173
(g i j ) = (g-jj 1 ,
174
Steffen
covariant d e r i v a t i v e
L. Lauritzen
i s defined as
(vD)(X...,Xk) = vD(X...,Xk)
D(X...,vX.,...,Xk).
If the
R(X,Y)Z = -R(Y,X)Z
b)
c)
R(X,Y,Z,W) = -R(Y,X,Z,W)
ii)
lii)
R(X,Y,Z,W) = -R(X,Y,W,Z)
iv)
R(X,Y,Z,W) = R(Z,W,X,Y)'
Statistical Manifolds
(r
175
1nn jk "
,= g(R(x,Y)Y,x)
(
X
'Y
g(x,x)g(Y,Y)-g(x,Y) 2
If the curvature
where (X/g(X,X),u1,...,u
c 1 R(u i ,u i )
where u,,...
We then have the identity
9 u m is an orthonormal system in Tp(M).
i
S(p) = K ( u
i.J
).
i 3
If M^ has a
of the
17 6
Steffen
L. Lauritzen
' ) =v x "
or,equivalently, as
H N (X,Y,Z) = g ( H N ( X , Y ) , Z )
3.
A family of probability measures P on a topological space _X inherits its topological structure from the weak topology. Most statistical models
are parametrized at least locally by maps (homeomorphisms)
: U + c l R m
i j k = "ij - f i j k
IR
-where
178
Steffen L. Lauritzen
(for =-l).
ijk "
4.
STATISTICAL MANIFOLDS
180
Steffen L. Lauritzen
(3.1)
It is the
(3.2)
-f [D(X,Y) - D(Y,X)] = 0.
111)
- g([x,Y],z)
Calculating now i) - ii) + iii) we get
Xg(Y,Z) - Zg(X,Y) + Yg(Z,X) = D(X 5 Y,Z)
-g([Z,X],Y) - g([Z,Y],X) - g([X,Y],Z) + 2g(v Y,Z).
Statistical Manifolds
181
(3.3)
(v*)* = v.
(v)* = v .
182
Steffen L. Lauritzen
Proof:
g(v Y,Z) = g(v Y,Z) - I D(X,Y,Z)
g(Y,v Z) = g(Y,v Z) + | D(X,Z,Y)
Adding and using the symmetry of D together with the defining property of the
Riemannian connection we get
g(v Y> Z ) + 9(Y^v Z) = Xg(Y.Z)
(3.4)
The relation (3.4) is important and was also obtained by Amari (1983). If we
now consider the curvature tensors R and R* corresponding to v and v* we obtain
the following identity:
3.5
Proposition
R(X,Y,Z,W) = -R*(X,Y,W,Z)
(3.5)
R = R*
11)
R(X,Y,Z,W) = -R(X,Y,W,Z)
Statistical Manifolds
183
* is torsion free
ii)
D-. is symmetric
iii)
v=
Proof: That D, is symmetric in the last two arguments follows from the
calculation
D^X.Y.Z) = g(v*Y,Z) - g(v Y,Z)
= Xg(Y>Z) - g(Y,v Z) - [Xg(Y,Z)-g(Y,v*Z)]
= D (X,Z,Y)
The difference between two connections is always a tensor field,
i) -* ii)
184
Steffen L.
Lauritzen
+ ^g(Y,v Z) = Xg(Y,Z)
and we obtain
3.9
Corollary
- 1
-1
v* = v, v = 7 , v = v*.
V ^ V ^ " ^'h^ V*
We have thus established a one-to-one correspondence between a statistical
manifold (NUg,D) and a Riemannian manifold with a connection v whose conjugate
v* is torsion free, the relation being given as
D(X,Y) = v*Y - v Y
v Y = v Y - ^ 2 D(X 9 Y).
In some ways it is natural to think of the statistical manifolds as
being induced by the metric (Fisher information) and one connection (v) (the
exponential), but the representation (M,g,D) is practical for mathematical
purposes, because D has simpler transformational properties than v.
By direct calculation we further obtain the following identity for
a statistical manifold and its -connections
3.10 Proposition
g(v Y,Z) - g(v Z,Y) = g(v Y,Z) - g(v Z,Y)
Proof: The result follows from
g(v Y,Z) - g(v Z,Y) = g(v Y,Z) - g(v Z 9 Y)
- |D(X,Y,Z) + |D(X,Z,Y)
and the symmetry of D.
(3.6)
Statistical Manifolds
185
i)
R = R for all lR
ii)
F is symmetric
Proof: The proof reminds a bit of bookkeeping. We are simply going to establish the identity
R(X,Y,Z,W) - R(X,Y,Z,W) = {F(X,Y,Z,W) - F(Y,X,Z,W)}
(3.7)
by brute force.
Symmetry of F in the last three variables follows from the symmetry
of D. We have
2F(X,Y,Z,W) = 2XD(Y,Z,W)
-2(D(vY,Z,W) + D(Y,vZ,W) + D(Y,Z,vW))
Since v = ^(v + v) and oD(X,Y,Z) = g(v Y,Z) - g(v Y,Z) we further get
2D(vY,Z,W) = 2g(v z W,v Y) - 2g(v z W,v Y)
= g(v z W,v Y) + g(v z W,v Y)
- g(v z W,v Y) - g(v z W,v Y),
and similarly for the two other terms. Further we get
2XD(Y,Z,W) = 2X(g(vZ,W) - g(v Z,W))
= 2g(v v Z,W) - 2g(v v Z,W)
+ 2g(v Z,v W) - 2g(v Z,v W)
Collecting terms we get the following table of terms in 2F(X,Y,Z,W), where
lines 1-3 are from 2XD(Y,Z,W), 4 and 5 from 2D(vY,Z,W) 6 and 7 from
2D(Y,vZ,W) and 8 and 9 from 2D(Y,Z,v W).
186
Steffen L. Lauritzen
with - sign
- -
1.
2g(v vZ,W)
2g(v v Z,W)
2.
g(v ,v w)
g(v Z,v W)
3.
g(v Y,v w)
4.
g(v z w,v )
5.
g(v z w,v )
6.
g(v W,v Z)
8.
g(v Z,v W)
9.
g w v^- 9 ^ v " /
g(v W,v Z)
-
g(v W,v Z)
-
g(v z,v w)
g(v z w,v )
g(v z w,v )
g(v W,v Z)
g(v z,v w)
7.
g(v Z,v W)
0
3 n + such that R = 0,
i.e. that the manifold is n-f
Statistical Manifolds
187
R(X,Y,Z,W) = -R(X,Y,W,Z)
R(X,Y,Z,W) = R(Z,W,X,Y)j
This implies as mentioned earlier that the sectional curvature determines the
curvature tensor.
We shall later see examples of statistical manifolds actually
generated by a statistical model that are not, conjugate symmetric.
It also follows that the condition
3 0 t 0 such that FP = R
(3.9)
ni
= E.(())
() + () - i = 0.
(3.10)
188
Steffen L. Lauritzen
gfvj,z) p = g(v,z) = o
* &h
Statistical Manifolds
189
where exp is the exponential map associated with the -connection and T (Nj~ is
the set of tangent vectors orthogonal to N at p. In general the exponential map
might not be defined on all T p (Nj~, but then let it be maximally defined,
p is then an -estimate of p, assuming N^ if
P A (p).
Amari (1985) shows that if M_ is -flat and N_ is --geodesic, then the -estimate
is uniquely determined and it minimizes a certain divergence function.
This suggest that it might be worthwhile studying procedures that
use the --estimate for -geodesic hypotheses N_, and call such a procedure
geometric estimation.
5.
zation we obtain the following expressions for the metric, the -connections
and the D-tensor (skewness) expressed as T... (cf. Amari, 1985).
. 1 /I 0\
"7lo 2)
lll
122
-,
212
12
= -(1+cO/ 3
22
222
T
^ =
-<
= 2n
221
21
]21
1 1 2 = (l-)/3
= T
= T
121 = T 2
2/
221 "
T
222
8 /
Statistical Manifolds
191
For = 0 (the Riemannian case) we have K(,p) = -1/2 and the manifold is the
space of constant negative curvature (Poincar's haifplane or hyperbolic space).
Note that it also has constant -curvature for all although nobody knows what
that implies, since such objects have r\e\/er been studied previously.
To find all -geodesic submanifolds of dimension 1 we proceed as
follows. Let (e,E) denote the tangent vector fields
e = -2-
E = -?L
8 *
= {(,) |=Q},oIR
(4.1)
where we have let (t) = + (t) and extended to a function defined on all of
the manifold by (x,y):= (x).
ot
where we have used torsion freeness and the fact that e() = , E() = 0. Using
k we get
now the expressions for r..,
N
192
Steffen L. Lauritzen
1- . l+2 2
= -x +
do
(4.3)
The (V ,V ") manifolds do not represent -geodesic foliations since they are
not -geodesic for the same value of . For = 0 we see that the geodesic sub2
2
manifolds are parabola's in (y, ) with coefficient -k to y , a result also
obtained by Atkinson and Mitchell (1981) and Skovgaard (1984).
Consider now the hypothesis (y,) V , i.e. that of constant variation coefficient. We shall illustrate the idea of geodesic estimation in this
example as described at the end of section 3.
2
V is =l+2 geodesic. The ancillary manifolds to be considered
are then --geodesic manifolds orthogonal to V .
An arbitrary --submanifold is the "parabola"
Statistical Manifolds
193
= (-(l+2)y2+B+C)!5
p
which follows from (4.3) with = -(l+2 ). Its tangent vector is equal to
[-2(l+2)y+B]E+e.
e+E = ^
anc
=
y
i]it?_,O[
V
1f <0.
1+Y
2
The manifolds W , ylR actually constitute a -(l+2 ) -foliation of the Gaussian
manifold.
(l+ 2 )x 2 +S 2
(l+ 2 )x+ 2
194
Steffen L. Lauritzen
-2
Statistical Manifolds
195
-(l+2 ) -parallel and whose lengths are equal to one. Further they are to be
orthogonal to the hypothesis (and thus tangent to the estimation manifolds).
The directions should thus be given as
v = (v v 2 ) = -e + j^ E
To obtain unit length, we get ||v|
V 2
when =, and our orthogonal field is thus
V(y) [v^J.
a[-,i]
196
Steffen L. Lauritzen
+ (l+2 )f (t ,)
(4.5)
(4.6)
to be equivalent to
2
f = K"
Inserting (4.5) into this we obtain
(4.7)
2
f = K(-(HV + (l+22)f2
G(*2-E-*-) = Kt+C
\4.o)
4 2 + 1 G{fl)
__ _ a 4 2 4 2 + l t + y +
Y
and dividing by 4v +1 yields thus
()
6(1)
Statistical Manifolds
197
hypothesis as those where t is fixed and only y varying, we see that s/x is in
one-to-one correspondence with t. We shall therefore say that s/x is the
geometric ancillary and this it also is the geometric test statistic for the
hypothesis =.
It is of course interesting, although not surprising, that this
test statistic (ancillary) is obtained solely by geometric arguments but still
equal to the "natural" when considering the transformation structure of the
model.
f(;x.*>-
V 3 / 2 , ,>o
w.r.t. Lebesgue measure on IR + . We choose to study this manifold in the parametrization (n,), where
n =x "
, i.e.
f(x n.) =
The metric tensor and the skewness tensor can now be calculated either by using
their definition directly or by calculating these in the (,) coordinates and
using transformation rules of tensors. We get
1
g=
and T 1 1 2 = 0 ,
ni
2 2 2
in
= -O+)/(2n 3 ), Tm
2 2 ] = (l-o)/(2 2 ), ] 2 2 =
222
2 1 2
= (3-l)/(2 2 n)
= -(l+)/(2 2 )
S t a t i s t i c a l Manifolds
T\ = -(l+)/
199
f, = j 2 - ^ = 0
2 2 = (3-l)/2.
To find all geodesic submanifolds of dimension one we first notice
2
are -geodesic for all , i.e. geodesic and they constitute a geodesic foliation
of the inverse Gaussian manifold. Because
$ X = " 1
they correspond to hypotheses of constant expectation.
Consider now a submanifold of the form (n(t),t), i.e. with tangent
N given as
N = e + E, where e = ^ , E = .
We extend by letting (x,y): = n(y), i.e. such that e() = 0, E() = . Then
e +
2 V
+ V
/'*
1+
1+ 2
l-\
l - \ , / 1+
1+
,, 3 - l \
jE
= ( " + ) )
+ ((- +
We now have V..N = hN i f f
r
[
1+
, 3-l
ir
] = [
1+
3-l
"1
-1
, 1-
~r
200
Steffen L. Lauritzen
l-3
=(-l)t
.^2
_^
Whereby
n(t) - %
For =l (the exponential connections) we get the parabolas:
n(t) = Bt 2 + C
and for =-l (the mixture connection) we get the curves:
(t) = -t + B/t + C.
In the Riemannian case (=0) we get
n(t) = -2t + B / T + C
that are parabolas in (/~,).
The curvature tensor is given by
The manifold is thus conjugate symmetric (we already know, since it is an exponential family) and the sectional curvature is
( 12 ) = -R 1 2 1 2 /(g 1 1 g 2 2)
-O- 2 )/2
Note that the Riemannian curvature (=0) is again constant equal to -h9 as in
the Gaussian case. In fact the -curvature is exactly as in the Gaussian case.
We can map the inverse Gaussian manifold to the Gaussian by letting
V = J2Q
= /2
and this map is a Riemannian isometry. However, it does not preserve the skewness tensor and thus the Gaussian and inverse Gaussian manifolds do not seem to
be isomorphic as statistical manifolds, although they are as Riemannian manifolds.
Corresponding to the hypothesis of constant coefficient of variation, we shall investigate the submanifold corresponding to the exponential
Statistical Manifolds
201
^y
>0
2(-i:
1 ^ ~ - 2+37'
(5.1)
to get the intersecting point and orthogonality at ( ^ ~ ' &,?) gives
3+l
l-9
'
Combining this with (5.1) we get C=0, i.e. the estimation manifolds are given as
3+
t+\
2(+) +
1-3
.
l-9
The manifolds W^, >0 again constitute a --foliation of the inverse Gaussian
D
"
"0 0
2
9 -1 Q
8 Q j
2
3+l
from
202
Steffen
^*
L. L a u r i t z e n
]/
7.
>0, 3>0
w.r.t. Lebesgue measure on IR + . The metric tensor is obtained by direct calculation in the (y,)-parametrization as
"T
g-
0 (3)
+ 3
j 9 ik " 3 k g ij-' t 0
be
ijk = t ( E 1 E j ( ) E k ( ) )
rni
= -23/3
122
t0 be
r 1 2 1 = 1/2
1
212
222
221 =
1
and the skewness tensor T\ ... = 2 ( r ^ - k - r . . k )
T
2
T
221
122 = T 212 =
203
222
'
204
Steffen L. Lauritzen
lll
2= 7 T
<^y
i.
=r
= l^1
2
^11 9 2
r
= il2
*222
2
2y
!i= -
[
23
22
4 ()
The space is conjugate symmetric and therefore the curvature tensor is fully
determined by the sectional (scalar) curvature which is
K
< ) = -R
g 1 g 2 2 = 1-2 [()+'()3
Note that this is even for =0 different from the two previous examples in that
the curvature is non-constant and truly dependent on the shape parameter 3.
To find all geodesic submanifolds we proceed as follows:
If =un is constant on
N,
X(N)
is spanned by the tangent vector E corresponding
to differentiation w.r.t. the second coordinate. Since
these submanifolds are geodesic for all values of and constitute a geodesic
foliation of the gamma manifold.
Considering the manifold given by 3=3Q> its tangent space is spanned by e and since
Statistical Manifolds
205
we have
vj e + E (f e + E) = f 2 v e e = 2fv e E + f e + v E E
. J+SL+ f U2L+ f ] e+ [f
= L
ll_+ J ^
2
( )
()
which unfortunately does not seem soluble in general. For =l the solutions are
f(t) = t/(At+B).
8.
In the present section we shall see that things are not always as
simple as the previous examples suggest, but even then we seem to be able to get
some understanding from geometric considerations.
First we should like to notice that when we combine two experiments
independently with the same parameter space, both the Fisher information metric
and the skewness tensors are additive. Let X^P Q Y^P and let A., B. denote the
derivative of the two log-likelihood functions
A
i= dr1og
f (x;)
B =
i drj" 1 o g g(y;)
= EAjAjAk
EB.B.B,
since all terms containing both A's and B's vanish due to the independence and
the fact that EA = EB. = 0.
2 = ^
222 =8 / 3 '
Since derivatives of the metric are as in the Gaussian case, so are the i\ ...symbols:
206
Statistical Manifolds
122
212
121
207
a
221
3
222
2 2 2 =-2(l+2)/ .
-(1+]
But the -connections are truly different which is seen by looking at the r..symbols:
^ = (l-)/(2+3)
]2 = ^
= -(l+)/
2 2 = -2(l+2)/(2+3)
and all others equal to zero. Considering now the curvature tensor we get
K
1212
4 ,(2+
9 + 2x )
2112
= 0, but we
1211
= R
2111
= R
1222 = R 2122 = s
such that the above components are the only ones that are not vanishing.
If we try to find the geodesic submanifolds we first observe that
l
because r 0 = 0 for all . the submanifolds
= {(,)|=0)
are totally geodesic for all , and thus constitute a geodesic foliation of the
manifold.
208
Steffen L. Laritzen
e + f E we get
,a
o+f
F ( e + fE )= v e e+ 2 f v ^ E +
+ fE
e+f E
e f VCE E
= _ 2 ( 1 + a ) e + ( -a _ 2(l+2a) 'fZ^
f
2+ 3 2+ 3
i 2
= 1/3
? 2 2 2 = -3/
f 2 1 1 = f 1 2 1 = -1/ 3
f
n l
= 1 2 2 = 2 1 2 = 221 = 0.
Statistical Manifolds
209
??9
r121 = r211 =
^
=
("3-4)/
= 1 9 9 = 9 1 9 = 9 9 = 0,
or in the r. .-symbols:
^ = (l-)/3
^ = ^ =
\2 = -(3+4)/3.
The curvature tensor can be calculated to be
2
- (l-)(3+)
K
1212 "
4
S
_
1221
(U)(3-)
4
^e s P a c e
S n o t
= {(,)|=0}
ve+fE
^r(e+fE) = vee + 2fveE + f v_E
E + fE
#
3f + 3ff = -1.
210
Steffen L. Lauritzen
= - I -* u = - -jt + At + B,
2
i.e. again parabolas in the (y, ) parametrization but with a different coefficient to t 2 .
Note that, in fact, considered as a Riemannian manifold there is no
essential difference between this and the univariate Gaussian manifold, since
we have constant scalar Riemannian curvature equal to
-
--i.
2 = Q2 is 1-geodesic.
0 9
9T^
'
+
B + B -^
,B
then reduces to
f --j_ -1 ,
ff
f
if 2 ^
but if = /2/2, the equation has no solution!! In other words, all "constant
variation coefficient submanifolds" of the manifold studies are -geodesic for
2
suitably chosen except one ( = h).
A reasonable explanation for this is at present beyond my imagination. Is there a missing connection (=~)? Have I made a mistake in the calculations? Or is it just due to the fact that the phenomenon is related to how
this model is a submodel of the strange two-dimensional model. In any case,
there is a remarkable disharmony between the group structure and the geometry.
Statistical Manifolds
211
and the skewness-tensor and the -connections are identical to the Gaussian
case when only indices 1 and 2 appear and all involving the third coordinate
are equal to zero. Letting (e,E,F) denote the basis vectors for the tangent
space determined by coordinatewise differentiation, we consider now the "constant coefficient of variation" submanifold:
(t,t, log t ) , t IR + }
with tangent-vector N = e + E + xF, and we get
V
NN
2V e E
Inserting the expressions for the -connections we obtain
+
N = . 2 ^ te - ( 2t
v N
F
t)E - .2
t
but also
2
9.
are
Statistical Manifolds
213
two Riemannian manifolds are locally isomorphic if they have identical curvature
tensors. Do similar things hold for statistical manifolds and their -curvatures? Note that the inverse Gaussian and Gaussian manifolds seem to be alike
but not fully isomorphic. Results of Amari (1985) seem to indicate that -flat
families are yery similar to exponential families. Are they in some sense
equivalent? There might be many interesting things to be seen in this direction.
4. Some statistical manifolds seem to have special properties. As
mentioned above we have e.g. -flat families, but also manifolds that are
conjugate symmetric or manifolds with constant -curvatre both for a particular
and for all at the same time. Which maps preserve these properties? Can
they in some sense be classified?
5. How does the geometric structures behave when we form marginal
and conditional experiments? Some work has been done on this by BarndorffNielsen and Jupp (1984, 1985).
6. Is there a decomposition theory for statistical manifolds. We
have seen that there might be a connection between the existence of geodesic
foliations and independence of estimates. There might be a de Rham-like theory
to be discovered by studying parallel transports along closed curves in flat
manifolds?
7. Chentsov (1972) showed that the expected geometries were the
only ones that obeyed the axioms of a decision theoretic view of statistics, in
the case of finite sample spaces. It seems of interest to investigate generalizations of this result, both to more general spaces and to other foundational
frameworks. Picard (1935) has generalized the result to the case of exponential
families and has some results pertaining to the general case.
8. What insight can be gained by studying the difference between
observed and expected geometries?
9. How is the relation between the geometric structure of a Lietransformation group and the geometric structure of its transformational statis-
214
Steffen L. Lauritzen
tical models?
Other questions and problems are raised by Barndorff-Nielsen, Cox,
and Reid (1986) and in the book by Amari (1985).
Acknowledgements
The author is grateful to Ole Barndorff-Nielsen, Preben Blaesild,
and Erik Jtfrgensen for discussions relevant to this manuscript at various
stages.
REFERENCES
Amari, S.-I. (1982). Differential geometry of curved exponential families curvatures and information loss. Ann. Statist. ]_0, 357-385.
Amari, S.-I. (1985). Differential-Geometrical Methods in Statistics.
Lecture
University.
Barndorff-Nielsen, 0. E. and Jupp, P. E. (1985). Profile likelihood, marginal
likelihood and differential geometry of composite transformation models.
Res. Rep. 122. Dept. Theor. Stat., Aarhus University.
Boothby, W. S. (1975). An Introduction to Differentiate Manifolds and Riemannian Geometry, Academic Press.
Chentsov, N. N. (1972). Statistical Decision Rules and Optimal Conclusions (in
Russian) Nauka, Moscow. Translation in English (1982) by Amer. Math. Soc.
Rhode Island.
Efron, B. (1975). Defining the curvature of a statistical problem (with discussion). Ann. Statist. 3_ 1189-1242.
215
216
Steffen L. Lauritzen
Publish or Perish.
1. Introduction
219
222
226
228
231
6. Geodesic Distances
234
7. References
238
217
1.
INTRODUCTION
220
C. R.
Rao
criteria for the choice of an appropriate metric for a given problem. Amari has
stated that a metric should reflect the stochastic and statistical properties
of the family of probability distributions. In particular he emphasized the
invariance of the metric under transformations of the variables as well as the
v
parameters. Cencov (1972) shows that the Fisher information metric is unique
under some conditions including invariance. Burbea and Rao (1982a) showed that
the Fisher information metric is the only metric associated with invariant
divergence measures of the type introduced by Ciszar (1967). However, there
exist other types of invariant metrics as shown in Section 3 of this paper.
The choice of a metric naturally depends on a particular problem
under investigation, and invariance may or may not be relevant. For instance,
consider the space of multinomial distributions, = {(p,,...,p ): p. > 0,
p. = 1}, which is a submanifold of the positive orthant, X = {(x-.,...,x ):
x. > 0} of the Euclidean space R n .
(1.2)
(1.3)
221
models for observed data to examine consistency and robustness of results. The
variety of metrics reported in the paper would be of some use in this direction.
(2.1)
(2.2)
222
223
+ vp.)
(2.3)
nn ^2,/^ ,_
nnn ,,3,/
,_%
= 2T g"j()d1dJ. + 37 c".jk()did..dk+...
(2.4)
In (2.4), the coefficients of the first order differentials vanish since J(,)
has a minimum at = , and the notation such as 3 J(,=)/3.3. is used for
replacing by after carrying out the indicated differentiations.
From the definition of the J function, it follows that the (g..) is
a non-negative definite matrix and obeys the tensorial law under transformation
of parameters. We define the matrix and the associated differential metric
|_j
II
(2.5)
Proof:
r j
By definition
2
H.() = 9 J(,=)
=
3 2 H(,=) _ ^ 3 2 H(=)
3 3
3 3
^2 7)
3^
^TJ
J
Differentiating both sides of (2.8) with respect to . we have
32H(,=)
3.3.
1
32H(,=) _
3.3.
i
3 2 H()
3.3.
1
,9 ftx
8
^
(2
(9
K
x
}
224
C. R. Rao
H(P) = - j h(p)dv(x)
(2.10)
H , *_ h , _
gtJ
32H(,=)
9()
"
Ti \j
T
3 h(p +p )
3.3.
' J
dv(x)
= u [h"(pj ^ ^ p *
J
3. 3 j
If h(x) = x log x, leading to Shannon's entropy, then
(2 1 2 )
3 loq p
(2.13)
which provide the elements of -order entropy information matrix, and the
corresponding differential metric given in Burbea and Rao (1982a, 1982b).
We prove Theorem 2.2 which gives alternative expressions for the
coefficients of the third order differentials in the expansion of J(,).
Theorem 2.2.
C
(^
Proof: By definition
c
33J(,=)
() =
3 H(,=) _
3 H()
(2
Differential
Metrics in Probability
3 H(,=)
93
225
9 H(,=)
33
Spaces
3 H()
98
3 H()
3.3-3.
2
m
ijk
(2.17)
Adopting the notation of Amari for -connexion
()
= ( D+ I Z T
1
ij k
S'jk
2 'ijk
ijk
ijk
(0)+
jki
(0)-.
ikj
f 2 9
19
3.
P+R
K(x,y)p(x)p(y)dv(x)dv(y)
(3.1)
nn
II K(x x )a a < 0
11
for any choice of (x, ,...,x ) and of (a-,,...,a ) such that a-.+...+a = 0, with
the further condition K(x,y) = 0 if x = y.
that the quadratic entropy is concave over P^ and its Jensen difference has
nice convexity properties which makes it an ideal measure of diversity.
In
ij^'
3 Q(p o + p
3i3J.
Observing that
Q(p + p j = I K(x,y)[p(x,)+p(x,)][p(y,)+p(y,)]dv(x)dv(y),
226
(3.2)
= -2 J K(x,y) M
dv(x)8v(y)
227
(3.3)
Using the expression (2.14), we find on carrying out the necessary computations
where
8p(x>]
X
ijk = IMK (x
' yy); 3,J
k
ij
(3.4)
It is of interest to note that the expressions (3.3) and (3.4) are invariant for
transformations of both the parameters and variables.
For further properties of quadratic entropies, the reader is referred to Lau (1984) and Rao (1984).
4.
Burbea and Rao (1982a, 1982b), Burbea (1986) and Eguchi (1984)
have considered metrics arising out of a variety of divergence measures between
probability distributions.
D F (p ,p ) = j F[p(x,),p(x,)]dv(x)
(4.1)
F( , ) is a C -function of R + x R +
(ii)
3F(x
= x
Let
- 9 2 F(x,y) F
_ 3 2 F(x,y) rF
_ 32F(x,y)
11 ' 3 2
' 2 " 3x3y
' 22 " 3 y 2
3y
Then
'i
228
...
(4.2)
9P
F
3P
3P
3P
9P
+
229
3P
3j.3k " 8 T ] d v ( x ) *
(4.3)
In this case
g
g
ii j
aj
(4.4)
where g. are the elements of Fisher's information matrix. Thus a wide class
of invariant divergence measures provide the same informative geometry on the
parameter manifold.
I r, \ _
9D
f*(u) = u f
230
C. R. Rao
(4.6)
and differentiating D .I
jk
" 3.3j9k
e t c # 9
= 0 yields
jj -
giving expressions for g.. and c... for a general D. However, the approach
IJ
1 JK
5.
(5.1)
o ( f V V = Do(')
2)
In the case when p(x,) is a p-variate normal density with mean y and fixed
variance covariance matrix , the coefficient (5.2) can be easily computed to be
proportional to 1 J , the (isj)-th element of " , which is indeed the (i,j)-th
element of the Fisher information matrix.
231
232
C. R.
Rao
(5.3)
(5.4)
Let
H(p) = - j h(p)dv(x)
as chosen in (2.10). Then (5.4) reduces to
which is the same as the h-entropy information matrix derived in (2.10), apart
from a constant.
Similarly
(1)
(1) + (1)
+
ijk ikj
jki
'ijk
where
m
3 log P A 9 log p
233
6. GEODESIC DISTANCES
(6.1)
where the matrix (g..) is positive definite, the geodesic curve = (t) can
J
in principle be determined from the Euler-Lagrange equations
n
..
nn
g 1 d - + Z i j k 1 j = 0, k = 1
n
(6.2)
and from the boundary conditions
(t-j) = , (t 2 ) = .
In (6.2), the quantity
ijk
i [ a 9 j k
I
(6 3)
235
In each case we
give the probability function p(x,) and the associated geodesic distance of
(,) based on the Fisher information metric.
() Poisson distribution
p(x,) = e" X/x!, x = 0,1,..., > 0
g(,) = 2\/Q - /" I
(2)
dy!+2d^
(6.5)
tanh"](l,2)
(6.6)
236
C. R. Rao
\C.,r\f
\L.
From (6.6)
n!
l
k
p ( n r . . . , n k ; .,,...,^) = ^ ^ . . n ^ - --\
n fixed.
g(
r 2
COS
1
The above computation was originally done by Rao (1945), but an easier method
237
n
P ( X
1'
>V 1
TT ~
Ol
n
g(1,...,n;1,...,n) = 2[(/i -
_ 1+6.(1,2) 1 1
k=l
l-k(l,2)
= rt [ I log2
(14)
k k 2 } +2( kl + k2 )
REFERENCES
Amari, S. I. (1982). Differential geometry of curved exponential families curvature and information loss. Ann. Stat. 10, 357-385.
Amari, S. I. (1983). A foundation of information geometry. Electronics and
Communications in Japan 66-A, 1-10.
Atkinson, C. and Mitchell, A. F. S. (1981). Rao's distance measure. Sankhya
43, 345-365.
Burbea, J. (1986).
1, 347-378.
Burbea, J. and Rao, C. Radhakrishna (1982a). Entropy differential metric,
distance and divergence measures in probability spaces: a unified
approach. J. Multivariate Anal. 12, 575-596.
Burbea, J. and Rao, C. Radhakrishna (1982b). Differential metrics in probability spaces. Probability Math. Statist. 3, 115-132.
Cencov, N. N. (1982). Statistical decision rules and optimal inference.
Transactions of Mathematical Monographs 53, Amer. Math. Soc.,
Providence.
Csiszar, I. (1967).
238
239
Nei, M. (1978). The theory of genetic distance and evolution of human races.
Japan J. Human Genet. 2!3, 341-369.
Oiler, J. M. and Cuadras, C. M. (1985). Rao's distance for negative multinomial distributions. Sankhya 47, 75-83.
Rao, C. Radhakrishna (1945).
240
C. R.
Rao
IEEE Trans.