Sunteți pe pagina 1din 246

Institute of Mathematical Statistics

LECTURE NOTES-MONOGRAPH SERIES

Differential Geometry in
Statistical Inference
S.-l. Amari, O. E. Barndorff-Nielsen,
R. E. Kass, S. L. Lauritzen, and C. R. Rao

Institute of Mathematical Statistics


LECTURE NOTES-MONOGRAPH SERIES
Shanti S. Gupta, Series Editor

Volume 10

Differential Geometry in
Statistical Inference
S.-l. Amari, O. E. Barndorff-Nielsen,
R. E. Kass, S. L. Lauritzen, and C. R. Rao

Institute of Mathematical Statistics


Hayward, California

Institute of Mathematical Statistics


Lecture Notes-Monograph Series
Series Editor, Shanti S. Gupta, Purdue University

The production of the IMS Lecture Notes-Monograph Series is


managed by the IMS Business Office: Nicholas P. Jewell, IMS
Treasurer, and Jose L. Gonzalez, IMS Business Manager.

Library of Congress Catalog Card Number: 87-82603


International Standard Book Number 0-940600-12-9
Copyright 1987 Institute of Mathematical Statistics
All rights reserved
Printed in the United States of America

TABLE OF CONTENTS

CHAPTER 1. Introduction
Robert E. Kass

CHAPTER 2. Differential Geometrical Theory of Statistics


Shun-ichi Amari

19

CHAPTER 3. Differential and Integral Geometry in Statistical Inference


O. E. Bamdorff-Nielsen

95

CHAPTER 4. Statistical Manifolds


Steffen L. Lauritzen

163

CHAPTER 5. Differential Metrics in Probability Spaces


C. R. Rao

217

in

CHAPTER 1.

INTRODUCTION

Robert E. Kass*

Geometrical analyses of parametric inference problems have developed


from two appealing ideas: that a local measure of distance between members of a
family of distributions could be based on Fisher information, and that the
special place of exponential families in statistical theory could be understood
as being intimately connected with their loglinear structure. The first led
Jeffreys (1946) and Rao (1945) to introduce a Riemannian metric defined by
Fisher information, while the second led Efron (1975) to quantify departures
from exponentiality by defining the curvature of a statistical model. The
papers collected in this volume summarize subsequent research carried out by
Professors Amari, Barndorff-Nielsen, Lauritzen, and Rao together with their
coworkers, and by other authors as well, which has substantially extended both
the applicability of differential geometry and our understanding of the role it
plays in statistical theory.**
The most basic success of the geometrical method remains its concise
summary of information loss, Fisher's fundamental quantification of departure
from sufficiency, and information recovery, his justification for conditioning.
Fisher claimed, but r\e\/er showed, that the NILE minimized the loss of information
among efficient estimators, and that successive portions of the loss could be

Department of Statistics, Carnegie-Mellon University, Pittsburgh, PA


**
These papers were presented at the NATO Advanced Workshop on Differential

Geometry in Statistical Inference at Imperial College, April, 1934.

Robert E. Kass

recovered by conditioning on the second and higher derivatives of the loglikelihood function, evaluated at the MLE. Concerning information loss, recall
that according to the Koopman-Darmois theorem, under regularity conditions, the
families of continuous distributions with fixed support that admit finitedimensional sufficient reductions of i.i.d. sequences are precisely the exponential families. It is thus intuitive that (for such regular families) departures
from sufficiency, that is, information loss, should correspond to deviations
from exponentiality. The remarkable reality is that the correspondence takes a
beautifully simple form. The most transparent case, especially for the untrained eye, occurs for a one-parameter subfamily of a two-dimensional exponential
family. There, the relative information loss, in Fisher's sense, from using a
statistic T in place of the whole sample is
11m i()"[ni()-iT()] = 2 + \ 3 2

(1)

where ni() is the Fisher information in the whole sample, i () is the Fisher
information calculated from the distribution of T, is the statistical curvature of the family and 3 is the mixture curvature of the "ancillary family"
associated with the estimator T. When the estimator T is the MLE, 3 vanishes;
this substantiates Fisher's first claim.
In his 1975 paper, Efron derived the two-term expression for information loss (in his equation (10.25)), discussed the geometrical interpretation
of the first term, and noted that the second term is zero for the MLE. He
defined to be the curvature of the curve in the natural parameter space that
describes the subfamily, with the inner product defined by Fisher information
replacing the usual Euclidean inner product. The definition of 3 is exactly
analogous to that of , with the mean value parameter space used instead of the
natural parameter space, but Efron did not recognize this and so did not
identify the mixture curvature. He did stress the role of the ancillary family
associated with the estimator T (see his Remark 3 of Section 9 and his reply to
discussants, p. 1240), and he also noticed a special case of (1) (in his reply,
p. 1241). The final simplicity of the complete geometrical version of (1)

Introduction

appeared in Amari's 1982 Annals paper. There it was derived in the multiparameter case; see equation (4.8) of Amari's paper in this volume.
Prior to Efron's paper, Rao (1961) had introduced definitions of
efficiency and second-order efficiency that were intended to classify estimators
just as Fisher's definitions did, but using more tractable expressions. This
led to the same measure of minimum information loss used by Fisher (correspond2
ing to in equation (1)). Rao (1962) computed the information loss in the
case of the multinomial distribution for several different methods of estimation.
Rao (1963) then went on to provide a decision-theoretic definition of secondorder efficiency of an estimator T, measuring it according to the magnitude of
the second-order term in the asymptotic expansion of the bias-corrected version
of T. Efron's analysis clarified the relationship between Fisher's definition
and Rao's first definition. Efron then provided a decomposition of the secondorder variance term in which the right-hand side of (1) appeared, together with
a parameterization-dependent third term. The extension to the multiparameter
case was derived by Madsen (1979) following the outline of Reeds (1975). It
appears here in Amari's paper as Theorem 3.4.
An analytically and conceptually important first step of Efron's
analysis was to begin by considering smooth subfamilies of regular exponential
families, which he called curved exponential families. Analytically, this made
possible rigorous derivations of results, and for this reason such families
were analyzed concurrently by Ghosh and Subramaniam (1974). Conceptually, it
allowed specification of the ancillary families associated with an estimator:
the ancillary family associated with T at t is the set of points y in the sample
space of the full exponential family - equivalently, the mean value parameter
space for the family - for which T(y) = t. The terminology and subsequent
detailed analysis is due to Amari but, as noted above, the importance of the
ancillary family, at once emphasized and obscured by Fisher, was apparent from
Efron's presentation.
The ancillary family is also important in understanding information

Robert E. Kass

recovery, which is the reason Amari has chosen to use the modifier "ancillary."
In the discussion of Efron's paper, Pierce (1975) noted another interpretation
of statistical curvature: it furnishes the asymptotic standard deviation of
observed information. More precisely, it is the asymptotic standard deviation

-1 [I() - ni()], where


of the asymptotically ancillary statistic n-1/2
' i()
ni() is expected information and I() is observed information; the oneparameter statement appears in Efron and Hinkley, (1978), and the multiparameter
version is in Skovgaard (1985). When fitting a curved exponential family by the
method of maximum likelihood, this statistic becomes a normalized component of
the residual (in the direction normal to the model within the plane spanned by
the first two derivatives of the natural parameter for the full exponential
family). Furthermore, conditioning on this statistic recovers (in Fisher's
sense) the information lost by the NILE, at least approximately. When this
conditional distribution is used, the asymptotic variance of the NILE may be
estimated by the inverse of observed rather than expected information; in some
problems observed information is clearly superior.
This argument, sketched by Pierce and presented in more detail by
Efron and Hinkley, represented an attempt to make sense of some of Fisher's
remarks on conditioning.

In Section 4 of his paper in this volume, Amari

presents a comprehensive approach to information recovery as measured by Fisher


information. He begins by defining a statistic T to be asymptotically sufficient of order q when
ni() - i T () = 0(n' q + 1 )
and asymptotically ancillary of order q when
i T () = 0(n" q ) .
These definitions differ from some used by other authors, such as Cox (1980),
McCullagh (1984a), and Skovgaard (1985). They are, however, clearly in the
spirit of Fisher's apparent feeling that i () is an appropriate measure of
information. To analyze Fisher's suggestion that higher derivatives of the
loglikelihood function could be used to create successive higher-order

Introduction

approximate ancillary statistics, Amari defines an explicit sequence of


combinations of the derivatives: he takes successive components of the residual
in spaces spanned by the first p derivatives - of the natural parameter for the
ambient exponential family - but perpendicular to the space spanned by the first
p-1, then normalizes by higher-order curvatures. In Theorems 4.1 and 4.2
Amari achieves a complete decomposition of the information. He thereby makes
specific, justifies, and provides a geometrical interpretation for Fisher's
second claim.

In Amari's decomposition the p-th term is attributable to the.

p-th statistic in his sequence and has magnitude equal to n"^


square of the p-th order curvature.

times the

(Actually, Amari's treatment is more

general than the rough description here would imply since he allows for the use
of efficient estimators other than the MLE.)
As far as the basic issue of observed versus expected information is
concerned, Amari (1982b) used an Edgeworth expansion involving geometrically
interpretable terms (as in Amari and Kumon, 1983) to provide a general motivation for using the inverse of observed information as the estimate of the
conditional variance of the MLE. See Section 4.4 of the paper here. (In truth,
the result is not as strong as it may appear. When we have an approximation v
to a variance v satisfying v() = vn(){l + 0(n" )}, and we use it to estimate
v(), we substitute v (), where is some estimator of , and then all we

usually get is v() = v (){ + 0 (n

I/O

)}. For essentially this reason,

observed information does not in general provide an approximation to the conditional variance of the MLE based on the underlying true value , having
relative error 0 (n ) - although it does do so whenever expected information is
constant, as it is for a location parameter. Similarly, as Skovgaard, 1985,
points out in his careful consideration of the role of observed information in
inference, when estimated cumulants are used in an Edgeworth expansion it loses
its higher-order approximation to the underlying density at the true value.
This practical limitation of asymptotics does not affect Bayesian inference, in
which observed information furnishes a better approximation to the posterior

Robert E. Kass

variance than does expected information for all regular families.)


For curved exponential families, then, the results summarized in the
first few sections of Amari's paper provide a thorough geometrical interpretation of the Fisherian concepts of information loss and recovery and also Rao's
concept of second-order efficiency.

In addition, in section 3.4 Amari discusses

the geometry of testing, as had Efron, providing comparisons of several commonlyused test procedures with the locally most powerful test. Curved exponential
families were introduced, however, for their mathemetical and conceptual
simplicity rather than their applicability. To extend his one-parameter
results, Efron, in his 1975 paper, did two things: he noted that any smooth
family could be locally approximated by a curved exponential family, and he
provided an explicit formula for statistical curvature in the general case.
In Section 5 of his paper, Amari shows how results established for curved
exponential families may be extended by constructing an appropriate Hubert
bundle, about which I will say a bit more below. With the Hubert bundle,
Amari provides a geometrical foundation, and generalization, for Efron's suggestion. From it, necessary formulas can be derived.
One reason that the role of the mixture curvature in (1) and in the
variance decomposition went unnoticed in Efron's paper was that he had not
made the underlying geometrical structure explicit: to calculate statistical
curvature at a given value Q of a single parameter in a curved exponential
family, Efron used the natural parameter space with the inner product defined
by Fisher information at the natural parameter point corresponding to Q. In
order to calculate the curvature at a new point ,, another copy of the natural
parameter space with a different inner product (namely, that defined by Fisher
information at the natural parameter point corresponding to ,) would have to be
used. The appropriate gluing together of these spaces into a single structure
involves three basic elements: a manifold, a Riemannian metric, and an affine
connection. Riemannian geometry involves the study of geometry determined by
the metric and its uniquely associated Riemannian connection.

In his discussion

Introduction

to Efron's paper, Dawid (1975) pointed out that Efron had used the Riemannian
metric defined by Fisher information, but that he had effectively used a nonRiemar. ian affine connection, now called the exponential connection, in calculating statistical curvature. Although Dawid did not identify the role of the
mixture curvature in (1), he did draw attention to the mixture connection as an
alternative to the exponential connection.

(Geodesies with respect to the

exponential connection form exponential families, while geodesies with respect


to the mixture connection form families of mixtures; thus, the terminology.)
Amari, who had much earlier researched the Riemannian geometry of Fisher information, picked up on Dawid1s observation, specified the framework, and provided
the results outlined above.
The manifold with the associated linear spaces is structured in what
is usually called a tangent bundle, the elements of the linear spaces being
tangent vectors. For curved exponential families, the linear spaces are finitedimensional, but to analyze general families this does not suffice so Amari
uses Hubert spaces. When these are appropriately glued together, the result
is a Hubert bundle. The idea stems from Dawid1 s remark that the tangent
vectors can be identified with score functions, and these in turn are functions
having zero expectation. As his Hubert space at a distribution P, Amari takes
the subspace of the usual Lp(P) Hubert space consisting of functions that have
zero expectation with respect to P. This clearly furnishes the extension of
the information metric, and has been used by other authors as well, e.g.,
Beran (1977). Amari then defines the exponential and mixture connections and
notes that these make the Hubert bundle flat, and that the inherited connections on the usual tangent bundles agree with those already defined there. He
then decomposes each Hubert space into tangential and normal components,
which is exactly what is needed to define statistical curvature in the general
setting. Amari goes on to construct an "exponential bundle" by associating
with each distribution a finite-dimensional linear space containing vectors
defined by higher derivatives of the loglikelihood function, and using structure

Robert E. Kass

inherited from the Hubert bundle. With this he obtains a satisfactory version
of the local approximation by a curved exponential family that Efron had
suggested.
This pretty construction allows results derived for curved exponential families to be extended to more general regular families, yet it is not
quite the all-encompassing structure one might hope for: the underlying
manifold is still a particular parametric family of densities rather than the
collection of all possible densities on the given sample space. Constructions
for the latter have so far proved too difficult.
In his Annals paper, Amari also noted an interesting relationship
between the exponential and mixture connections: they are, in a sense he
defined, mutually dual. Furthermore, a one-parameter family of connections,
which Amari called the -connections, may be defined in such a way that for each
the -connection and the --connection are mutually dual, while =l and -1
correspond to the exponential and mixture connections. See Amari's Theorem 2.1.
This family coincides with that introduced by Centsov (1971) for multinomial
distributions. When the family of densities on which these connections are
defined is an exponential family, the space is flat with respect to the exponential and mixture connections, and the natural parametrization and mean-value
parameterization play special roles: they become affine coordinate systems for
the two respective connections and are related by a Legendre transformation.
The duality in this case can incorporate the convex duality theory of exponential families (see Barndorff-Nielsen, 1978, and also Section 2 of his paper in
this volume).

In Theorem 2.2 Amari points out that such a pair of coordinate

systems exists whenever a space is flat with respect to an -connection (with


f 0). For such spaces, Amari defines -divergence, a quasi-distance between
two members of the family based on the relationship provided by the Legendre
transformation.

In Theorem 2.4 he shows that the element of a curved exponential

family that minimizes the -divergence from a point in the exponential family
parameter space may be found by following the -geodesic that contains the

Introduction

given point and is perpendicular to the curved family. This generates a new
class of minimum -divergence estimators, the MLE being the minimum
-1-divergence estimator, an interpretation also discussed by Efron (1978).
As applications of his general methods based on -connections on
Hubert bundles, Amari treats the problems of combining independent samples (at
the end of section 5 ) , making inferences when the number of nuisance parameters
increases with the sample size (in section 6 ) , and performing spectral estimation in Gaussian time series (in section 7).
As soon as the -connections are constructed a mathematical question
arises. On one hand, the -connections may be considered objects of differential geometry without special reference to their statistical origin. On the
other hand, they are not at all arbitrary. They are the simplest one-parameter
family of connections based on the first three moments of the score function.
What is it about their special form that leads to the many special properties
of -connections (outlined by Amari in Section 2)1
Lauritzen has posed this question and has provided a substantial
part of the answer. Given any Riemannian manifold M with metric g there is a
unique Riemannian connection v. Given a covariant 3-tensor D that is symmetric
in its first two arguments and a nonzero number c, a new (symmetric) connection
is defined by
v =v+c

(2)

which means that given vector fields X and Y,


vY = v Y+ c

D(X,Y)

where
9

(D(X,Y),Z)

D(X,Y,Z)

for all vector fields Z. Now, when M is a family of densities and g and D are
defined, in terms of an arbitrary parameterization, as

10

Robert E. Kass

where % is the loglikelihood function, and if c = -/2, then (2) defines the
-connection.
In this statistical case, D is not only symmetric in its first two
arguments, as it must be in (2), it is symmetric in all three. Lauritzen
therefore defines an abstract statistical manifold to be a triple (M,g,D) in
which M is a smooth m-dimensional manifold, g is a Riemannian metric, and D is
a completely symmetric covariant 3-tensor. With this additional symmetry
constraint alone, he then proceeds to establish a large number of basic properties, especially those relating to the duality structure Amari described. The
treatment is "fully geometrical11 or "coordinate-free." This is aesthetically
appealing, especially to those who learned linear models in the coordinate-free
setting. Lauritzen's primary purpose is to show that the appropriate mathematical object of study is one that is not part of the standard differential
geometry, but does have many special features arising from an apparently simple
structure. He not only presents the abstract generalities about -connections
on statistical manifolds, he also examines five examples in full detail. The
first is the univariate Gaussian model, the second is the inverse Gaussian
model, the third is the two-parameter gamma model, and the last two are
specially constructed models that display interesting possibilities of the nonstandard geometries of -connections. In particular, the latter two statistical
manifolds are not what Lauritzen calls "conjugate symmetric" and so the
sectional curvatures do not determine the Riemann tensor (as they do in
Riemannian geometry). He also discusses the construction of geodesic foliations, which are decompositions of the manifold and are important because they
generate potentially interesting decompositions of the sample space. At the
end of his paper, Lauritzen calls attention to several outstanding problems.
Amari's -connections, based on the first three moments of the
score function, do not furnish the only examples of statistical manifolds. In
his contribution to this volume, Barndorff-Nielsen presents another class of
examples based instead on certain "observed" rather than expected derivatives

Introduction

11

of the loglikelihood.
Although the idea of using observed derivatives might occur to
any casual listener on being told of Amari's use of expectations, it is not
obvious how to implement it. First of all, in order to define an observed
information Riemannian metric, one needs a definition of observed information
at each point of the parameter space. Apparently one would want to treat each
as if it were an MLE and then use I(). However, I() depends on the whole
sample y rather than on alone, so this scheme does not yet provide an explicit
definition. Barndorff-Nielsen's solution is natural in the context of his
research on conditionality: he replaces the sample y with a sufficient pair
(,a) where a is the observed value of an asymptotically ancillary statistic A.
This is always possible for curved exponential families, and in more general
models A could at least be taken so that (,A) is asymptotically sufficient.
With this replacement, the second component may be held fixed at A=a while
varies. Writing I() = I,g () thus allows the definition I() I, JQ)
to be made at each point in the parameter space. Using this definition of
the Riemannian metric, Barndorff-Nielsen derives the coefficients that determine the Riemannian connection. From the transformation properties of tensors,
he then succeeds in finding an analogue of the exponential connection based on
a certain mixed third derivative of the loglikelihood function (two derivatives
being taken with respect to as parameter, one with respect to as MLE). In
so doing, he defines the tensor D in the statistical manifold and thus arrives
at his "observed conditional geometry."
Barndorff-Nielsen's interest in this geometry lies not with
analogues of statistical curvature and other expected-geometry constructs, but
rather with an alternative derivation, interpretation, and extension of an
approximation to the conditional density of the MLE, which had been obtained
earlier (in Barndorff-Nielsen and Cox, 1979).

In several papers, Barndorff-

Nielsen (1980, 1983) has discussed generalizations and approximate versions of


Fisher's fundamental density-likelihood formula for location models

12

Robert E. Kass

p( |a,) = c

L()/L()

(3)

where is the NILE, a is an ancillary statistic, p is the conditional density


of the NILE, and L is the likelihood function.

(This is discussed in Efron and

Hinkley, 1978; Fisher actually treated the location-scale case.) The formula
is of great importance both practically, since it provides a way of computing
the conditional density, and philosophically, since it entails the formal
agreement of conditional inference and Bayesian inference using an invariant
prior. Inspection of the derivation indicates that the formula results from
the transformational nature of the location problem, and Barndorff-Nielsen has
shown that a version of it (with an additional factor for the volume element)
holds for yery general transformation models. He has also shown that for nontransformation models, a version of the right-hand side of (3) while not
exactly equal to the left-hand side, remains a good asymptotic approximation for
it.

(See also Hinkley, 1980, and NlcCullagh, 1984a.) In his paper in this

volume, Barndorff-Nielsen reviews these results, shows how the various observed
conditional geometrical quantities are calculated, and then derives his desired
expansion (of a version of the right-hand side of (3)) in terms of the geometrical quantities that correspond to those used by Amari in his expected
geometry expansions. Barndorff-Nielsen devotes substantial attention to transformation models, which may be treated within his framework of observed
conditional geometry.

In this context, the models become Lie Groups, for which

there is a rich mathematical theory.


In the fourth paper in this volume, Professor Rao returns to the
characterization of the information metric that originally led him (and also
Jeffreys) to introduce it: it is an infinitesimal measure of divergence based
on what is now called Shannon entropy. Rao considers here a more general class
of divergence measures, which he has found useful in the study of genetic
diversity, leading to a wide variety of metrics. He derives the quadratic and
cubic terms in Taylor series expansions of these measures and shows how, in the
case of Shannon entropy, the cubic term is related to the -connections.

Introduction

13

The papers here collectively show that geometrical structures of


statistical models can provide both conceptual simplifications and new methods
of analysis for problems of statistical inference. There is interesting
mathematics involved, but does the interesting mathematics lead to interesting
statistics?
The question arises because geometry has provided new techniques,
and its formalism produces convenient summaries for complicated multivariate
expressions in asymptotic expansions (as in Amari and Kumon, 1983, and
McCullagh, 1984b), but it has not yet created new methodology with clearly
important practical applications. Thus, it is already apparent from (1) that
there exists a wide class of estimators that minimize information loss (and are
second-order efficient): it consists of those having zero mixture curvature
for their associated ancillary families. It is interesting that the MLE is only
one member of this class, and it is nice to have Eguchi's (1983) derivation that
certain minimum contrast estimators are other members, but it seems unlikely though admittedly possible - that any competitor will replace maximum likelihood
estimation as the primary method of choice in practice. Similarly, there is
not yet any reason to think that alternative minimum -divergence estimators or
their observed conditional geometry counterparts will be considered superior to
the MLE.
On the other hand, as I indicated at the outset, geometry does
give a definitive description of information loss and recovery. Since Fisher
remains our wisest yet most enigmatic sage, it is worth our while to try to
understand his pronouncements.

Together with the triumvirate of consistency,

**
Since Rao's work on second order efficiency arose in an attempt to understand
Fisher's computation of information loss in estimation, it might appear that
Efron's investigation also began as an attempt to understand Fisher. He has
informed me, however, that he set out to define the curvature of a statistical
model and came later to its use in information loss and second-order efficiency.

14

Robert E. Kass

sufficiency, and efficiency, information loss and recovery form the core of
Fisher's theory of estimation. On the basis of the geometrical results, it is
fair to say that we now know what Fisher was talking about, and that what he
said was true. Here, as in other problems (such as inference with nuisance
parameters, discussed in Amari's section 5, or in nonlinear regression, e.g.,
Bates and Watts, 1980, Cook and Tsai, 1985, Kass, 1984, McCullagh and Cox, 1936),
the geometrical formulation tends to shift the burden of derivation of results
away from proofs, toward definitions. Thus, once the statement of a proposition
is understood, its truth is easier to see and in this there is great simplification. One could make this argument about much abstract mathematical development, but it is particularly appropriate here.
Furthermore, there are reasons to think that future work in this
area could lead to useful results that would otherwise be difficult to obtain.
One important problem that structural research might solve is that of determining useful conditions under which a particular root of the likelihood equation
will actually maximize the likelihood. Global results on foliations might be
very helpful, as might be formulas relating computable characteristics of
statistical manifolds to the behavior of geodesies. The results in these papers
could turn out to play a central role in the solution of this or some other
practical problem of statistical theory. We will have to wait and see. Until
then, readers may enjoy the papers as informative excursions into an intriguing
realm of mathematical statistics.
Acknowledgements
I thank 0. E. Barndorff-Nielsen, D. R. Cox, and C. R. Rao for their
comments on an earlier draft. This paper was prepared with support from the
National Science Foundation under Grant No. NSF/DMS - 8503019.

REFERENCES

Amari, S. (1982a). Differential geometry of curved exponential families curvatures and information loss. Ann. Statist. 10, 357-387.
Amari, S. (1982b). Geometrical theory of asymptotic ancillarity and conditional
inference. Biometrika 69, 1-17.
Amari, S. and Kumon, M. (1983). Differential geometry of Edgeworth expansions
in curved exponential family. Ann. Inst. Statist. Math. 35A, 1-24.
Barndorff-Nielsen, 0. E. (1978).

Information and Exponential Families,

New York: Wiley.


Barndorff-Nielsen, 0. E. (1980). Conditionality resolutions. Biometrika 67,
293-310.
Barndorff-Nielsen, 0. E. (1983). On a formula for the distribution of the
maximum likelihood estimator. Biometrika 70, 343-305.
Barndorff-Nielsen, 0. E. and Cox, D. R. (1979). Edgeworth and Saddlepoint
approximations with statistical applications, (with Discussion).
J. R. Statist. Soc. B41, 279-312.
Bates, D. M. and Watts, D. G. (1980). Relative curvature measures of nonlinearity. J. R. Statist. Soc. B42, 1-25.
Beran, R. (1977). Minimum Hellinger distance estimates for parametric models.
Ann. Statist. 5, 445-463.
Centsov, N. N. (1971). Statistical Decision Rules and Optimal Inference (in
Russian). Translated into English (1982), AMS, Rhode Island.
Cook, R. D. and Tsai, C.-L. (1985). Residuals in nonlinear regression.
Biometrika 72, 23-29.
15

16

Robert E. Kass

Cox, D. R. (1930). Local ancillarity.

Biometrika 62, 269-276.

Dawid, A. P. (1975). Discussion to Efron's paper. Ann. Statist. 3, 1231-1234.


Efron, B. (1975). Defining the curvature of a statistical problem (with
applications to second-order efficiency), (with Discussion).
Ann. Statist. 3, 1189-1242.
Efron, B. (1978). The geometry of exponential families. Ann. Statist. 6,
362-376.
Efron, B. and Hinkley, D. V. (1978). Assessing the accuracy of the maximum
likelihood estimator: Observed versus expected Fisher information,
(with discussion). Biometrika 65, 457-487.
Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in
a curved exponential family. Ann. Statist. 11, 793-803.
Fisher, R. A. (1925). Theory of statistical estimation.

Proc. Camb. Phil. Soc.

22^, 700-725.
Fisher, R. A. (1934). Two new properties of mathematical likelihood.

Proc.

R. Soc. A144, 285-307.


Ghosh, J. K. and Subramaniam, K. (1974). Second order efficiency of maximum
likelihood estimators. Sankya 36A, 325-358.
Hinkley, D. V. (1980). Likelihood as approximate pivotal distribution.
Biometrika 67, 287-292.
Jeffreys, H. (1946). An invariant form for the prior probability in estimation
problems. Proc. Roy. Soc. A!86, 453-461.
Kass, R. E. (1984). Canonical parameter!zations and zero parameter-effects
curvature. J. Roy. Statist. Soc. B46, 1, 86-92.
Madsen, L. T. (1979). The geometry of statistical model - a generalization of
curvature. Res. Report 79-1. Statist. Res. Unit, Danish Medical
Res. Council.
McCuliagh, P. (1984a). On local sufficiency.

Biometrika 71, 233-244.

McCullagh, P. (1984b). Tensor notation and cumulants of polynomials.


Biometrika 71, 461-476.

Introduction

McCullagh, P. and Cox, D. R. (1986).


Ann.

il

Invariants and likelihood ratio statistics.

Statist. 14, 1419-1430.

Pierce, D. A. (1975). Discussion to Efron's paper. Ann. Statist. 3, 1219-1221.


Rao, C. R. (1945).

Information and accuracy attainable in the estimation of

statistical parameters. Bull. Calcutta Math. Soc. 37, 81-89.


Rao, C. R. (1961). Asymptotic efficiency and limiting information. Proc.
Fourth Berkeley Symp. Math. Statist. Prob., Edited by J. Neyman,
1, 531-545.
Rao, C. R. (1962). Efficient estimates and optimum inference procedures in
large samples (with discussion). J. Roy. Statist. Soc. B24, 46-72.
Rao, C. R. (1963). Criteria of estimation in large samples. Sankya 25, 189206.
Reeds, J. (1975). Discussion to Efron's paper. Ann. Statist. 3, 1234-1238.
Skovgaard, I. (1985). A second-order investigation of asymptotic anciilarity.
Ann. Statist. 13, 534-551.

DIFFERENTIAL GEOMETRICAL THEORY OF STATISTICS


Shun-ichi Amari*

1.

Introduction

21

2.

Geometrical Structure of Statistical Models

25

3.

Higher-Order Asymptotic Theory of Statistical Inference in


Curved Exponential Family

38

4.

Information, Sufficiency and Ancillarity Higher Order Theory

52

5.

Fibre-Bundle Theory of Statistical Models

59

6.

Estimation of Structural Parameter in the Presence of Infinitely


Many Nuisance Parameters

73

7.

Parametric Models of Stationary Gaussian Time Series

83

8.

References

91

Department of Mathematical Engineering and Instrumentation Physics, University


of Tokyo, Tokyo, JAPAN

19

1.

INTRODUCTION

Statistics is a science which studies methods of inference, from


observed data, concerning the probabilistic structure underlying such data.
The class of all the possible probability distributions is usually too wide to
consider all its elements as candidates for the true probability distribution
from which the data were derived. Statisticians often assume a statistical
model which is a subset of the set of all the possible probability distributions, and evaluate procedures of statistical inference assuming that the model
is faithful, i.e., it includes the true distribution.

It should, however, be

remarked that a model is not necessarily faithful but is approximately so. In


either case, it should be very important to know the shape of a statistical
model in the whole set of probability distributions. This is the geometry of a
statistical model. A statistical model often forms a geometrical manifold, so
that the geometry of manifolds should play an important role. Considering that
properties of specific types of probability distributions, for example, of
Gaussian distributions, of Wiener processes, and so on, have so far been studied
in detail, it seems rather strange that only a few theories have been proposed
concerning properties of a family itself of distributions. Here, by the properties of a family we mean such geometric relations as mutual distances, flatness
or curvature of the family, etc. Obviously it is not a trivial task to define
such geometric structures in a natural, useful and invariant manner.
Only local properties of a statistical model are responsible for the
asymptotic theory of statistical inference. Local properties are represented
by the geometry of the tangent spaces of the manifold. The tangent space has a
21

22

Shun-ichi Amari

natural Riemannian metric given by the Fisher information matrix in the regular
case. It represents only a local property of the model, because the tangent
space is nothing but local linearization of the model manifold.

In order to

obtain larger-scale properties, one needs to define mutual relations of the two
different tangent spaces at two neighboring points in the model. This can be
done by defining a one-to-one affine correspondence between two tangent spaces,
which is called an affine connection in differential geometry. By an affine
connection, one can consider local properties around each point beyond the
linear approximation. The curvature of a model can be obtained by the use of
this connection.

It is clear that such a differential-geometrical concept pro-

vides a tool convenient for studying higher-order asymptotic properties of


inference. However, by connecting local tangent spaces further, one can obtain
global relations. Hence, the validity of the differential-geometrical method is
not limited within the framework of asymptotic theory.
It was Rao (1945) who first pointed out the importance in the
differential-geometrical approach. He introduced the Riemannian metric by using
the Fisher information matrix. Although a number of researches have been
carried out along this Riemannian line (see, e.g., Amari (1968), Atkinson and
Mitchell (1981), Dawid (1977), James (1973), Kass (1980), Skovgaard (1984),
Yoshizawa (1971), etc.), they did not have a large impact on statistics. Some
additional concepts are necessary to improve its usefulness. A new idea was
developed by Chentsov (1972) in his Russian book (and in some papers prior to
the book). He introduced a family of affine connections and proved their uniqueness from the point of view of categorical invariance. Although his theory was
deep and fundamental, he did not discuss the curvature of a statistical model.
Efron (1975, 1978), independently of Chentsov1s work, provided a new idea by
pointing out that the statistical curvature plays an important role in higherorder properties of statistical inference. Dawid (1975) pointed out further
possibilities. Efron's idea was generalized by Madsen (1979) (see also Reeds
(1975)). Amari (1980, 1982a) constructed a differential-geometrical method in

Differential Geometrical Theory of Statistics

23

statistics by introducing a family of affine connections, which however turned


out to be equivalent to Chentsov's. He further defined -curvatures, and pointed out "-he fundamental roles of the exponential and mixture curvatures played in
statistical inference. The theory has been developed further by a number of
papers (Amrn (1982b, 1983a, b ) , Amari and Kumon (1983), Kumon and Amari (1983,
1984, 1985), Nagaoka and Amari (1982), Eguchi (1983), Kass (1984)).

The new

developments were also shown in the NATO Research Workshop on Differential Geometry in Statistical Inference (see Barndorff-Nielsen (1985) and Lauritzen
(1985)).

They together seem to prove the usefulness of differential geometry as

a fundamental method in statistics.

(See also Csiszar (1975), Burbea and Rao

(1982), Pfanzagl (1982), Beale (1960), Bates and Watts (1980), etc., for other
geometrical work.)
The present article gives not only a compact review of various
achievements up to now by the differential geometrical method most of which have
already been published in various journals and in Amari (1985) but also a preview of new results and half-baked ideas in new directions, most of which have
not yet been published.

Chapter 2 provides an introduction to the geometrical

method, and elucidates fundamental geometrical properties of statistical manifolds. Chapter 3 is devoted to the higher-order asymptotic theory of statistical inference, summarizing higher-order characteristics of various estimators
and tests in geometrical terms. Chapter 4 discusses a higher-order theory of
asymptotic sufficiency and anci'llarity from the Fisher information point of
view. Refer to Amari (1985) for more detailed explanations in these chapters;
Lauritzen (1985) gives a good introduction to modern differential geometry. The
remaining Chapters 5, 6, and 7 treat new ideas and developments which are just
under construction.

In Chapter 5 is introduced a fibre bundle approach, which

is necessary in order to study properties of statistical inference in a general


statistical model other than a curved exponential family. A Hubert bundle and
a jet bundle are treated in a geometrical framework of statistical inference.
Chapter 6 gives a summary of a theory of estimation of a structural parameter

24

Shun-ichi Amari

in the presence of nuisance parameters whose number increases in proportion to


the number of observations. Here, the Hubert bundle theory plays an essential
role. Chapter 7 elucidates geometrical structures of parametric and non-parametric models of stationary Gaussian time series. The present approach is useful not only for constructing a higher-order theory of statistical inference on
time series models, but also for constructing differential geometrical theory of
systems and information theory (Amari, 1983 c). These three chapters are
original and only sketches are given in the present paper. More detailed theoretical treatments and their applications will appear as separate papers in the
near future.

2. GEOMETRICAL STRUCTURE OF STATISTICAL MODELS

Metric and -connection


Let S = {p(x,)} be a statistical model consisting of probability
density functions p(x,) of random variable xX with respect to a measure P on
X such that eyery distribution is uniquely parametrized by an n-dimensional
vector parameter = ( 1 ) = ( ,..., n ).

Since the set p(x)} of all the den-

sity functions on X is a subset of the L, space of functions in x, S is considered to be a subset of the L . space. A statistical model S is said to be geometrically regular, when it satisfies the following regularity conditions
A-. - Ag, and S is regarded as an n-dimensional manifold with a coordinate system
.

A-.. The domain of the parameter is homeomorphic to an n-dimensional Euclidean space R n .


A2

The topology of S induced from R n is compatible with the

relative topology of S in the L-. space.


A3

The support of p(x,) is common for all , so that p(x,)

are mutually absolutely continuous.


A*. Every density function p(x,) is a smooth function in
uniformly in x, and the partial derivative 9/ae1 and integration of log p(x,)
with respect to the measure P(x) are always commutative.

A. The moments of the score function (a/3 )log p(x,) exist up to


the third order and are smooth in .
A oc .

The Fisher information matrix is positive definite,

Condition 1 implies that S itself is homeomorphic to R n .


25

It is

26

Shun-ichi Amari

Figure 1
possible to weaken Condition 1. However, only local properties are treated
here so that we assume it for the sake of simplicity.

In a later section, we

assume one more condition which guarantees the validity of Edgeworth expansions.
Let us denote by 3. = 3/31 the tangent vector e. of the i-th
coordinate curve 1 (Fig. 1) at point . Then, n such tangent vectors e. = 3.,
i = 1,..., n, span the tangent space T

at point of the manifold S. Any tan-

gent vector AT is a linear combination of the basis vectors 3.,

A = AV,
where A

are the components of vector A and Einstein's summation convention is

assumed throughout the paper, so that the summation is automatically taken


for those indices which appear twice in one term once as a subscript and once as
a superscript.

The tangent space T

is a linearized version of a small neigh-

borhood at of S, and an infinitesimal vector d = d 3. denotes the vector


connecting two neighboring points and + d or two neighboring distributions
p(x,) and p(x, + d).
Let us introduce a metric in the tangent space T .
u

It can be done

by defining the inner product g i () = <3., 3 > of two basis vectors 3. and 3.
at . To this end, we represent a vector 3-T. by a function 3.(x,) in x,
where (x,)

= log p(x,) and 3^ (in 3.. A) is the partial derivative 3/3 .

Then, it is natural to define the inner product by


g. () = <3.,3.> = E [3,(x,)3,(x,)],
J

(2.1)

Differential Geometrical Theory of Statistics

27

where E Q denotes the expectation with respect to p(x,). This g.. is the

Ij

Fisher information matrix. Two vectors A and B are orthogonal when


<A,B> = <A1ai,BJ'a.> = ABJ'g.. = 0.
It is sometimes necessary to compare a vector AT of the tangent

space T A at one point with a vector BT , belonging to the tangent space T A ,

at another point '. This can be done by comparing the basis vectors a. at T
with the basis vectors a1, at T .. Since T A and T l are two different vector
ID

spaces, the two vectors a. and a, are not directly comparable, and we need some
way of identifying T n with T . in order to compare the vectors in them. This
can be accomplished by introducing an affine connection, which maps a tangent
space T Q X ._ at + d to the tangent space T n at . The mapping should reduce
to the identity map as d->0. Let m(a'.) be the image of 3' T Q , . mapped to T .
J

m(3'.) - 3.]

DTQU

It is slightly different from


. d
The vector
i J a.T
d->0

represents the rate at which the j-th basis vector a. T "intrinsically11 changes
J

as the point moves from to +d (Fig. 2) in the direction a.. We call


v. a. the covariant derivative of the basis vector 3. in the direction a..
Since it is a vector of T , its components are given by

,,. = <V 3 3,,3.> ,

Figure 2

(2.2)

28

Shun-ichi Amari

and

where r. .^ = -j-j"1^- We call r... the components of the affine connection. An


affine connection is specified by defining v_ 3. or r....

Let A ( ) be a vector

i JK

field, which assigns to every point S a vector A() = A ()a. T A . The


I

intrinsic change of the vector A() as the position moves is now given by the
covariant derivative in the direction a. of A() = A J ()a., defined by
in which the change in the basis vectors as well as that in the components

A () is taken in
into account. The covariant derivative in the direction B =

Ba

is given by

VBRA = B Y3.A.
We have defined the covariant derivative by the use of the basis
vectors a. which are associated with the coordinate system or the parametrization . However, the covariant derivative vJ\ is invariant under any parametrization, giving the same result in any coordinate system. This yields the transformation law for the components of a connection r... . When another coordinate
IJ K

system (parametrization) ' = '() is used, the basis vectors change from
{a.} to {a 1 .,}, where

a1., = B ] , 3
1*

and B , = ae /ae

'
1

is the inverse matrix of the Jacobian matrix of the coor-

dinate transformation. Since the components r1.., . ,k, of the connection are
written as

' i' j k'=

<va

j"9kl>

in this new coordinate system, we easily have the transformation law

We introduce the -connection, where is a real parameter, in the


statistical manifold S by the formula

D i f f e r e n t i a l Geometrical Theory of S t a t i s t i c s

()

29

1-

ijk

ij

'

' j

It is easily checked that the connection defined by (2.3) satisfies the transformation law. In particular, the 1-connection is called the exponential connection, and the -1-connection is called the mixture connection.
2.2

Imbedding and -curvature


Let us consider an m-dimensional regular statistical model M =

q(x*u)}, which is imbedded in S = p(x,)} by


q(x,u) = p{x,(u)}.
Here, u = (u a ) = (u ,...,u m ) is a vector parameter specifying distributions of
M, and defines a coordinate system of M. We assume that = (u) is smooth and
its Jacobian matrix has a full rank. Moreover, it is assumed that M forms an
m-dimensional submanifold in S. We identify a point uM with the point
= (u) imbedded in S. The tangent space T u (M) at u of M is spanned by m
a
vectors 3a . a = 1,..., m, where 3,
a = 3/3u denotes the tangent vector of the
coordinate curve u a in M. The basis 3 can be represented by a function
a
3a (x,u) in x as before, where (x,u) = log q(x,u). Since M is imbedded in S,
the tangent space T (M) of M is regarded as a subspace of the tangent space
T Q , x ( S ) of S at = (u). The basis vector 3 a T u (M) is written as a linear
combination of 3.,
3

a= B a ( u ) V

where B ] = 3 (u)/3u . This can be understood from the relation

Hence, the tangential directions of M at u is represented by m vectors 9 ,

a
(a = 1,...,m) or Ba = (Ba ) in the component form with respect to the basis 3.
l
of T

(u)< >
It is convenient to define n - m vectors 3 , K = m + 1,...,n in

T \u)
, x(S) such that n vectors {3a ,3K }, a = 1,...,m; K = m + l,...,n, together
form a basis of T / J S ) and moreover 3 's are orthogonal to 3 's, (Fig. 3 ) ,
\U)

9a (u) = < 3 a ' V =

Shun-ichi Amari

30

The v e c t o r s

span t h e o r t h o g o n a l
K

c o m p l e m e n t o f T (M) i n T , \ ( S ) .
U
^U)

the components of a with respect to the basis a. by a = B (u)a..


1

We d e n o t e

The inner

products of any two basis vectors in {a ,a } are given by


a

Figure 3

9ab

(u)

< 3

a'V

= B

<3

'V "

a b ij '
(2.4)

" W
The basis vector aa may change in its direction as point u moves in

'aa of a
in the
M. The change is measured by the -covariant derivative
a
direction a. , where the notion of a connection is necessary, because we need to
1

compare two vectors a and a belonging to different tangent spaces T / \(S) and
(u
T iu
/ i \(S). The -covariant derivative vd^
is calculated in S as
b

When the directions of the tangent space T (M) of M do not change as point u
moves in M, the manifold M is said to be -flat in S, where the tangent directions are compared by the -connection. Otherwise, M is curved in the sense of
the -connection. The -covariant derivative v ,a / a aa is decomposed into the
b

Differential

Geometrical Theory of Statistics

31

tangential component belonging to T (M) and the normal component perpendicular


to Tu (M). The former component represents the waya
9 changes withinuT (M),
while the latter represents the change of 9 in the directions perpendicular to
a
T (M), as u moves in M. The normal component is measured by

tl=<^a)\^>--(\i + <^)SX

(2.5)

a
which is a tensor called the -curvature of submanifold M in S. It is usually
called the imbedding curvature or Euler-Shouten curvature. This tensor represents how M is curved in S. A tensor is a multi-linear mapping from a number of
tangent vectors to the real set. In the present case, for A = A a 9 T (M)
au
B = B 9.T (M) and C = C K 9 belonging to the orthogonal complement of T (M), we
H ( ) hr
(A,B,C)
= H ( ^ AaBbC.
have the multi-linear mapping
,
This F ' is the -curvature tensor, and H\^' are i t s components. The submanifold M is -flat in S when H u

0 holds.

The m x m matrix

abK
h

ab

acK bd g

represents the square of the -curvature


-curvature of
ofM,
M, vwhere
^
g and g
matrix of g

and g ,, respectively.

are the inverse

Efron called the scalar

the statistical curvature in a one-dimensional model M, which is the trace of


the square of the exponential- or 1-curvature of M in our terminology.
Let = (t) be a curve in S parametrized by a scalar t. The curve
c: = (t) forms a one-dimensional submanifold in S. The tangent vector 9. of
the curve is represented in the component form as
9 t = i (t)9 i

or shortly by , where

denotes d/dt. When the direction of the tangent

vector 9. = does not change along the curve in the sense of the -connection,
the curve is called an -geodesic. By choosing an appropriate parameter, an
-geodesic (t) satisfies the geodesic equation

32

Shun-ichi Amari

v ^ =0
or in the component form

1 + rf J G ^ J k = 0 .

(2.6)

2.3 Duality in -flat manifold


Once an affine connection is defined in S, we can compare two
tangent vectors AT Q and A'T , belonging to different tangent spaces T and
T , by the following parallel displacement of a vector. Let c: = (t) be a
u
curve connecting two points and 1 . Let us consider a vector field A(t) =
A (t)a.T ,.x defined on each point (t) on the curve. If the vector A(t) does
not change along the curve, i.e., the covariant derivative of A(t) in the
direction vanishes identically
v A(t) = A(t) + r j k k (t) d = 0 ,
the field A(t) is said to be a parallel vector field on c. Moreover,
A(t')T , t l x at (t') is said to be a parallel displacement ofA (t)T Q / t x at
(t). We can thus displace in parallel a vector AT n at to another point 1
along a curve (t) connecting and 1 , by making a vector field A(t) which
satisfies the differential equation v A(t) = 0, with the boundary conditions

= (0), = (l), and A(0) = AT . The vector A1 = A(l)T , at 1 = (l) is


the parallel displacement of A from to 1 along the curve c: = (t). We
denote it by A1 = A. When the -connection is used, we denote the -parallel
displacement operator by . The parallel displacement of A from to 1 in
1

general depends on the path c: (t) connecting and . When this does not
depend on paths, the manifold is said to be flat. It is known that a manifold
is flat when, and only when, the Riemann-Christoffel curvature vanishes identically (see textbooks of differential geometry). A statistical manifold S is
said to be -flat, when it is flat under the -connection.
The parallel displacement does not in general preserve the inner
product, i.e., < A, B> = <A,B> does not necessarily hold. When a manifold has
two affine connections with corresponding parallel displacement operators

Differential Geometrical Theory of Statistics

33

and * and moreover when


< A,*B> = <A,B>

(2.7)

holds, the two connections are said to be mutually dual. The two operators
and * are considered to be mutually adjoint. We have the following theorem
in this regard (Nagaoka and Amari (1982)).
Theorem 2.1. The -connection and --connection are mutually dual.
When S is -flat, it is also --flat.
When a manifold S is -flat, there exists a coordinate system ( 1 )
such that
7 ft () 3. = 0 or rj^() = 0
identically holds. In this case, a basis vector d. is the same at any point
in the sense that 3 T is mapped to 3,.TQl by the -parallel displacement
I

I t )

irrespective of the path connecting and 1 . Since all the coordinate curves
are -geodesics in this case, is called an -affine coordinate system. A
linear transformation of an -affine coordinate system is also -affine.
We give an example of a 1-flat (i.e., = 1) manifold S. The
density functions of exponential family S = {p(x,)} can be written as
p(x,) = exp{ . - ()}
with respect to an appropriate measure, where = ( ) is called the natural or
canonical parameter. From
3.(x,) = x - a.(),

a^.JiU.) = -3 3j() ,

we easily have
g

ij

() =

Vj

( ) >

ijk

() =

"T i V k

Hence, the 1-connection r:./ vanishes identically in the natural parameter,


showing that gives a 1-affine coordinate system. A curve (t) = a t + b 1 ,
which is linear in the -coordinates, is a 1-geodesic, and conversely.
Since an -flat manifold is --flat, there exists a --flat coordinate system n = (n^) = (n 1 -...n n ) in an -flat manifold S. Let

3 = 3/a.j be the tangent vector of the coordinate curve n in the new coordin-

34

Shun-ichi Amari

ate system . The vectors O 1 } form a basis of the tangent space T (i.e. at
T n where = ()) of S. When the two bases O } and {a 1 } of the tangent space
T satisfy

at ewery point (or ), where '? is the Kronecker delta (denoting the unit
matrix), the two coordinate systems and are said to be mutually dual.
(Nagoaoka and Amari (1982)).
Theorem 2.2. When S is -flat, there exists a pair of coordinate
systems = ( ) and = (.) such that i) is -affine and is --affine,
ii) and are mutually dual, iii) there exist potential functions () and
() such that the metric tensors are derived by differentiation as

a j > = a'aj() ,
where g. . and g J are mutually inverse matrices so that

holds, iv) the coordinates are connected by the Legendre transformation


1 = 9 % ( n ) ,

= 3.()

(2.8)

where the potentials satisfy the identity


() + () - = 0,

(2.9)

where = ..

In the case of an exponential family S, becomes the cumulant


generating function, the expectation parameter = (.)

is -1-affine, and are mutually dual, and the dual potential () is given
by the negative entropy,
() = E[log p] ,
where the expectation is taken with respect to the distribution specified by .

Differential

2.4

Geometrical Theory of S t a t i s t i c s

35

-divergence and -projection


We can introduce the notion of -divergence D ( , ' ) in an - f l a t

manifold S, which represents the degree of divergence from d i s t r i b u t i o n p(x,)


to p ( x , ' ) .

I t is defined by
D (,') = () + (') - ' ,

(2.10)

where 1 = n(') are the -coordinates of the point 1 , i.e., the --coordinates
of the distribution p(x,').

The -divergence satisfies D (,') > 0 with the

equality when and only when = 1 . The --divergence satisfies D (,') =


D (',).

When S is an exponential family, the -1-divergence is the Kullback-

Leibler information,

D^. ) = I[p(x,') : p(x,)] =Jp(x,)log g j ^ ^ dP.


As a preview of later discussion, we may also note that, when
S = p(x)} is the function space of a non-parametric statistical model, the
-divergence is written as

D{p(x),q(x)} = - 4 , (1 " [P(x) ~ ) / 2 q ( x ) + ) / 2 dP)


1-

'

when f 1, and is the Kullback information or its dual when = -1 or 1.


When and 1 = + d are infinitesimally close,
D (, + d) = 1 g i ,()d 1 d J

(2.11)

holds, so that it can be regarded as a generalization of a half of the square


of the Riemannian distance, although neither symmetry nor the triangular
inequality holds for D . However, the following Pythagorean theorem holds
(Efron (1978) in an exponential family, Nagaoka and Amari (1982) in a general
case).
Theorem 2.3. Let c be an -geodesic connecting two points and
1

', and let c' be a --geodesic connecting two points and " in an -flat
S. When the two curves c and c 1 intersect at ' with a right angle such that
1

, and " form a right triangle, the following Pythagorean relation holds,

36

Shun-ichi Amari
D (,') + D (\") = D (,") .

(2.12)

Let M = q(x,u)} be an m-dimensional submanifold imbedded in an


-flat n-dimensional manifold S = p(x,)} by = (u). For a distribution
p(x, o )S, we search for the distribution q(x,u)M, which is the closest distribution in M to p(x, Q ) in the sense of the -divergence (Fig. 4a),
min D { 0 ,(u)} = D { Q ,(u)} .
uM
We call the resulting u( Q ) the -approximation of p(x, Q ) in M, assuming such
exists uniquely.

It is important in many statistical problems to obtain the

-approximation, especially the -1-approximation.

Let c(u) be the -geodesic

connecting a point (u)M and Q , c(u) : = (t u ) , (u) = (0,u), Q = (l,u)


(Fig. 4b). When the -geodesic c(u) is orthogonal to M at (u), i.e.,
<(0;u)9 a > = 0
where 3a = a/aua are the basis vectors of Tu (M), we call the u the -projection
of 0 on M.

The existence and the uniqueness of the -approximation and the

-projection are in general guaranteed only locally.

The following theorem was

first given by Amari (1982a) and by Nagaoka and Amari (1982) in more general
form.

Differential Geometrical Theory of Statistics

Figure 4

Theorem 2.4. The -approximation u( Q ) of Q in M is given by the


-projection U('Q) of Q on M.

37

3, HIGHER-ORDER ASYMPTOTIC THEORY OF STATISTICAL INFERENCE IN


CURVED EXPONENTIAL FAMILY

Ancillary family
Let S be an n-dimensional exponential family parametrized by the
natural parameter = ( ) and let M = q(x,u)} be an m-dimensional family
parametrized by u = (u a ), a = 1,..., m. M is said to be an (n,m)-curved exponential family imbedded in S = {p(x,)} by = (u), when q(x,u) is written as
q(x,u) = exp[ 1 (u)x i - {(u)}].
The geometrical structures of S and M can easily be calculated as follows. The
quantities in S in the -coordinate system are

ijk

The quantities in M are


< 3

a ' V = B a B b9ij '

abc " B a B b B c T ijk '

Bj-a^u).

Here, the basis vector 8 of T M (M) is a vector


a
u
3
B 3
a " ai
in T / \(S). If we use the expectation coordinate system in S, M is represented by = n(u). The components of the tangent vector a are given by
38

Differential Geometrical Theory of Statistics

ai = W

39

> "Ba9ji '

where : = BSI.9 1 , 8 1 = 3/3..


1
X
Let Xji|, (2)'*"' X (N) ^e N independent observations from a distribution q(x, )M. Then, their arithmetic mean
N
x = ( X/,x)/N
j=l [3)
is a minimal sufficient statistic. Since the joint distribution q(x,,x,...,
x,.,x; u) can be written as
N

IT q(x/,xu)
= exp[N{ (u)x.
- (u)}}],
[3)

j=l
the geometrical structure of M based on N observations is the same as that
based on one observation except for a constant factor N. We treat statistical
inference based on x. Since a point x in the sample space X can be identified
with a point = x in S by using the expectation parameter , the observed sufficient statistic x defines a point in S whose -coordinates are given by x,
= x. In other words, we regard x as the point (distribution) in S whose
expectation parameter is just equal to x. Indeed, this is the maximum likelihood estimator in the exponential family S.
Let us attach an (n-m)-dimensional submanifold A(u) of S to each
point uM, such that all the A(u)'s are disjoint (at least in some neighborhood
of M, which is called a tubular neighborhood) and the union of A(u)'s covers S
(at least the tubular neighborhood of M ) . This is called a (local) foliation of
S. Let v = (v ), = m + l , . . . , n b e a coordinate system in A(u). We assume
that the pair (u,v) can be used as a coordinate system of the entire S (at
least in a neighborhood of M ) . Indeed, a pair (u,v) specifies a point in S such
that it is included in the A(u) attached to u and its position in A(u) is given
by v (see Fig. 5). Let = (u,v) be the -coordinates of the point specified
by (u,v). This is the coordinate transformation of S from w = (u,v) to ,
where w = (u,v) = (w 3 ) is an n-dimensional variable, = 1,..., n, such that its
first m components are u = (u a ) and the last n - m components are v = (v ).

40

Shun-ichi Amari

Figure 5
Any point (in some neighborhood of M) in S can be represented uniquely by
w = (u,v). We assume that A(u) includes the point = (u) on M and that the
origin v = 0 of A(u) is put at the point uM. This implies that n(u,0) is the
point (u)M. We call A = {A(u)} an ancillary family of the model M.
In order to analyze the properties of a statistical inference
method, it is helpful to use the ancillary family which is naturally determined
by the inference method. For example, an estimator u can be regarded as a mapping from S to M such that it maps the observed point = x in S determined by
the sufficient statistic x to a point u(x)M.

Its inverse image " (u) defines

an (n-m)-dimensional subspace A(u) attached to uM,


A(u) = u"(ll) = {S I U() = U} .
Obviously, the estimator u takes the value u when and only when the observed x
is included in A(u). These A(u)'s form a family A = {A(u)} which we will call
the ancillary family associated with the estimator u. As will be shown soon,
large-sample properties of an estimator u are determined by the geometrical
features of the associated ancillary submanifolds A(u). Similarly, a test T
can be regarded as a mapping from S to the binary set {r,r}, where r and r
imply, respectively, rejection and acceptance of a null hypothesis. The

Differential Geometrical Theory of Statistics

41

inverse image T~^(r)<= S is called the critical region, and the hypothesis is
rejected when and only when the observed point = xS is in T (r). In order
to analyze the characteristics of a test, it is convenient to use an ancillary
family A = {A(u)} such that the critical region is composed of some of the
A(u)'s and the acceptance region is composed of the other A(u)'s. Such an
ancillary family is said to be associated with the test T.
In order to analyze the geometrical features of ancillary submanifolds, let us use the new coordinate system w = (u,v). The tangent of the
R

coordinate curve w is given by 3 O = 3/3w . The tangent space T (S) at point


p

n = (w) of S is spanned by {3 O }, 3 = 1*..., n. They are decomposed into two


P

parts {3p } = { 8a, 9 K }, 3 = 1,... n; a = 1,..., m; K = m + l,...,n. The former


part 3a = 3/3ua spans the tangent space T M (M) of M at u and the latter
3K =
U
3/3v spans the tangent space T (A) of A(u). Their components are given by
Bo

= 3 r

pi

o i, ( w ) n the basis 3 . They are decomposed as


P

with B . = 3 (u,v), B . = 8 .(u,v). The metric tensor in the w-coordinate


system is given by

where
B

The metric tensor is decomposed into three parts:


9ab(u) =< 3 a ' V
is the metric tensor of M,

= B

aiBbjglJ

(3 2 )

is the metric tensor of A(u), and


g^ = o a , 3 > = B a ,B .g 1 j
dK

(3.4)

i KJ

represents the angles between the tangent spaces of M and A(u). When g
dK

= 0, M and A(u) are orthogonal to each other at M. The ancillary family

(u,0)

4 2

S h u n - i c h i Amari

A = {A(u)} is said to be orthogonal, when g (u) = 0, where f(u) is the abbreviation of f(u,0) when a quantity f(u,v) is evaluated on M, i.e., at v = 0.
We may treat an ancillary family A., which depends on the number N of observations. In this case g _ also depends on N. When g a = <9 ,8 > is a quantity of
otp

aic

-1 /2
order N
converging to 0 as N tends to i n f i n i t y , the ancillary family is
said to be asymptotically orthogonal.
The -connection in the w-coordinate system is given by

where T

= B BgB T. .. . The M-part r v ' gives the components of the -connec-

tion of M and the A-part r \ ' gives those of the -connection of A(u). When A
is orthogonal, the -curvatures of M and A(u) are given respectively by

()

()

ab " ab '

() _

()

a " a "

(7

(3

r\

'6)

The quantities g3a (u), H\'


and HKa
\ ' are fundamental in evaluating asymptotic
auK
properties of statistical inference procedures. When = 1, the 1-connection is
called the exponential connection, and we use suffix (e) instead of (1). When
= -1, the -1-connection is called the mixture connection, and we use suffix
(m) instead of (-1).
3.2 Edgeworth expansion
We study higher-order asymptotic properties of various statistics
with the help of Edgeworth expansions. To this end, let us express the point
n = x defined by the observed sufficient statistic in the w-coordinate system.
The w-coordinates w = (u,v) are obtained by solving
X = (w) = (u,v) .

(3.7)

The sufficient statistic x is thus decomposed into two parts (u,v) which together are also sufficient. When the ancillary family A is associated with an
estimator or a test, u gives the estimated value or the test statistic,

Differential Geometrical Theory of Statistics

43

respectively. We calculate the Edgeworth expansion of the joint distribution of


(u,v) in geometrical terms. Here, it is necessary further to assume a condition
which guarantees the Edgeworth expansion. We assume that Cramer's condition is
satisfied. See, for example, Bhattacharya and Ghosh (1978).
When UQ is the true parameter of distribution, x converges to (u ,
0) in probability as the number N of observations tends to infinity, so that the
random variable w also converges to w Q = ( U Q , 0 ) .

Let us put

% = X - (uo,O)} , fif = T(W - W Q ) ,


u = (u - u 0 ) ,

v=yfv,

(3.8)

Then, by expanding (3.7), we can express w in the power series of x. We can


obtain the Edgeworth expansion of the distribution P ( W U Q ) of w = (u,v). However, it is simpler to obtain the distribution of the one-step bias-corrected
version w* of w defined by
w

"

where Ew denotes the expectation with respect to p(x,w). The distribution of w


is obtained easily from that of w*. (See Amari and Kumon (1983).)
Theorem 3.1. The Edgeworth expansion of the probability density
p(w*,u Q ) of w*, where q(x,u Q ) is the underlying true distribution, is given by
p(w*,u Q ) = n(S*;g3){l + J - K ^ h " *
6/N

4C 3 h

1A ^ )

O^3'2)} ,
(3.9)

24K 3 h

72K K h

where n(w*;g _) is the multivariate normal density with mean 0 and covariance
otp

q ^ = (g ) " ' 5 h ^ etc. are the tensorial Hermite polynomials in w* and


K

C
C

2 _ (m) (m)
.
" g 9 9
'etC"

The tensorial Hermite polynomials in w with metric g are defined


by

44

Shun-ichi Amari

where D = g (3/3w ), cf. Amari and Kumon (1983), McCullagh (1984). Hence,
h = w ,

h = 1,
i_ot

h = ww - g ,

cc

= w w w - g

w - g w

6 ot

w , etc.

Theorem 3.1 shows the Edgeworth expansion up to order N

of the

joint distribution of u* and v*, which together carry the full Fisher information. The marginal distribution can easily be obtained by integration.
Theorem 3.2. When the ancillary family is orthogonal, i.e., g a (u)
= 0, the distribution p(u*,u Q ) of u* is given by
p(u*,u 0 ) = n(u*;gab){l + J N - 1 / 2 K a b ( . h a b c
+ r f \ ( u * ) } + 0(N' 3 / 2 )
where

abc

(3.10)

c hb

= L

+ terms common to all the orthogonal ancillary families,


+

ab = (^4 2 ^ 4 (ab
f[T m;) 2 = (m) (m)g ce
df
ab cda efb 9 '
, ex2 _ (e) (e) cd K
M ab " H ac H bd g g '

(H }

^ - u(m) (m)
H
H
g
A ab " va b
]

nv^

3.3 Higher-order efficiency of estimation


Given an estimator u : S-M which maps the observed point = xS to
Q(x)M, we can construct the ancillary family A = {A(u)} by
A(u) = u" (u) = {S I U() = U} .
The A(u) includes the point (u) = n(u,0), when and only when the estimator is
consistent.

(We may treat a case when A(u) depends on N, denoting an ancillary

Differential Geometrical Theory of Statistics

family by A-Ju).

45

In this case, an estimator is consistent if lim A N (u)-*(u,0).)

Let us expand the covariance of the estimation error u = /N(u - u 0 ) as

cov[\b] = gf+ g i V / 2 + g j V + 0(N"3/2) .


A consistent estimator is said to be first-order efficient or simply efficient,
when its first-order term g* (u) is minimal among all the consistent estimators
at any u, where the minimality is in the sense of positive semidefiniteness of
matrices. The second- and third-order efficiency is defined similarly.
Since the first-order term g^ is given from (3.9) by
g

= (g

ab "g a g b g K ) " >

the minimality is attained, when and only when g

= 0, i.e., the associated

ancillary family is orthogonal. From this and Theorem 3.2, we have the following results.
Theorem 3.3. A consistent estimator is first-order efficient, iff
the associated ancillary family is orthogonal. An efficient estimator is always
second-order efficient, because of g 2 = 0.
There exist no third-order efficient estimators in the sense that
g~ (u) is minimal at all u. This can be checked from the fact that g^ includes
a term linear in the derivative of the mixture curvature of A(u), see Amari
(1985).

However, if we calculate the covariance of the bias-corrected version

u* = u - E-[u] of an efficient estimator u, we see that there exists the thirdorder efficient estimator among the class of all the bias-corrected efficient
estimators. To state the result, let g 3 a b = g^ 9 c a 9 b d be the lower index
_r g~
ab .
version of
Theorem 3.4. The third-order term g~ . of the covariance of a biascorrected efficient estimator u* is given by the sum of the three non-negative
geometric quantities

" \ ^ a b + (HM>ab+ \ <>.2b

(3 1 2 )

46

Shun-ichi Amari

The first is the square of mixture connection components of M, and depends on


the parametrization of M but is common to all the estimators. The second is
the square of the exponential curvature of M, which does not depend on the
estimator. The third is the square of the mixture curvature of the ancillary
submanifold A(u) at n(u), which depends on the estimator. An efficient estimator is third-order efficient, when and only when the associated ancillary family
is mixture-flat at (u). The m.l.e. is third-order efficient, because it is
given by the mixture-projection of to M.
The Edgeworth expansion (3.10) tells more about the characteristics
of an efficient estimator u*. When FT' vanishes, an estimator is shown to be
mostly concentrated around the true parameter u and is third-order optimal
under a symmetric unimodal loss function. The effect of the manner of parametrizing M is also clear from (3.10). The -normal coordinate system (parameter)
in which the components of the -connection become zero at a fixed point is \/ery
important (cf. Hougaard, 1983; Kass, 1984).
3.4 Higher-order efficiency of tests
Let us consider a test T of a null hypothesis H Q : uD against the
alternative H . : u^D in an (n,m)-curved exponential family, where D is a region
or a submanifold in M. Let R be a critical region of test T such that the
hypothesis H is rejected when and only when the observed point n = x belongs to
R. When T has a test statistic (x), the equation () = const, gives the
boundary of the critical region R. The power function Py(u) of the test T at
point u is given by
P (u) = I P(x u) dx >
where p(x u) is the density function of x when the true parameter is u.
Given a test T, we can compose an ancillary family A = {A(u)> such
that the critical region R is given by the union of some of A(u)'s, i.e., it
can be written as
R

"

Differential Geometrical Theory of Statistics

47

where R^ is a subset of M. Then, when we decompose the observed statistic


n = x into (u,v) by x = (u,v) in terms of the related w-coordinates, the hypothesis H Q is rejected when and only when R^. Hence, the test statistics (x)
is a function of only u. Since we have already obtained the Edgeworth expansion
of the joint distribution of (u,v) or of (u*,v*), we can analyze the characteristics of a test in terms of geometry of associated A(u)'s.
We first consider the case where M = q(x,u)} is one-dimensional,
so that u = (u a ) is a scalar parameter, indices a, b, etc becoming equal to 1.
We test the null hypothesis H Q : u = u Q against the alternative H . : u f u ,
Let u. be a point which approaches u Q as N tends to infinity by
ut = u Q + t(Ng)~/2 ,

(3.13)

i.e., the point whose Riemannian distance from u Q is approximately tN-1/2


',
where g = gaDk ( u nu) . The power P (u.,N)
1 x, of a test T at u.L is expanded as
P (u t ,N) = P l (t) + P T 2 (t)N" 1 / 2 + P 3 (t)f + 0(N" 3 / 2 ) .
A test T is said to be first-order uniformly efficient or, simply, efficient,
if the first-order term P l (t) satisfies Pj-j(t) > P-p-.(t) at all t, compared
with any other test T 1 of the same level. The second- and third-order uniform
efficiency is defined similarly. Let P(u.,N) be the envelope power function of
P (u t ,N)'s defined by
P(u t ,N) = sup P (u t ,N) .

(3.14)

Let us expand it as
P(u t ,N) = P (t) + P 2 (t)N" 1 / 2 + P 3 (t)N" 1 + 0(N" 3 / 2 ) .
It is clear that a test T is i-th order uniformly efficient, iff
P k (t) = P k (t)
holds at any t for k = l,...,i.
An ancillary family A = {A(u)} in this case consists of (n-1)dimensional submanifolds A(u) attached to each u or (u)M. The critical
region R is bounded by one of the ancillary submanifolds, say A(u + ), in the

48

Shun-ichi Amari

one-sided case, and by two submanifolds A(u + ) and A(u_) in the two-sided unbiased case. The asymptotic behavior of a test T is determined by the geometric
features of the boundary 8R, i.e., A(u+)[and A(u_)].

In particular, the angle

between M and A(u) is important. The angle is given by the inner product
9aa (u)
< =a <9 ,3 > of the tangent da of M and tangents 9K of A(u). When ga K (u) =
0 for all u, A is orthogonal. In the case of a test, the critical region and
hence the associated ancillary A and g (u) depend on N. An ancillary family is
said to be asymptotically orthogonal, when g (u) is of order N-1 /2 . We can
aK

assume gaK
can be expanded as
a (uu
n ) = 0, and g3
a (u.)
L

where Q . = a g. (u Q ). The quantity Q . represents the direction and the


magnitude of inclination of A(u) from being exactly orthogonal to M. We can
now state the asymptotic properties of a test in geometrical terms (Kumon and
Amari (1983), (1985)).
Theorem 3.5. A test T is first-order uniformly efficient, iff the
associated ancillary family A is asymptotically orthogonal. A first-order
uniformly efficient test is second-order uniformly efficient.
Unfortunately, there exist no third-order uniformly efficient test
(unless the model M is exponential family). An efficient test T is said to be
third-order tQ-efficient, when its third-order power P-ro(t) is minimal among
all the other efficient tests at t Q , i.e., when P J O U Q ) = P Q U Q ) *

anc

* when

there exist no tests T 1 satisfying P 3 (t) > PyoU) for all t. An efficient
test is third-order admissible, when it is t - efficient at some t Q . We define
the third-order power loss function (deficiency function) P (t) of an efficient
test T by
P (t) = lim N{P(u t ,N) - P (u t ,N)} = P 3 (t) - P^t) .

(3.16)

It characterizes the behaviors of an efficient test T. The power loss function


can be explicitly given in geometrical terms of the associated ancillary A
(Kumon and Amari (1983), Amari (1983a)).

Differential Geometrical Theory of Statistics

49

Theorem 3.6. An efficient test T is third-order admissible, only


when the mixture curvature of A(u) vanishes as N-* and the A(u) is not exactly
orthogonal to M but asymptotically orthogonal to compensate the exponential
(p\

curvature Hv. ' of model M such that


Q . =cH^
abK
ab<

(3.17)

holds for some constant c. The third-order power loss function is then given by
P (t) = a.(t,){c - J.(t,)}22 ,

(3.18)

where a.(t,) is some fixed function of t and , being the level of the test,
2 = h e ) H^ e ) a c ^
19}
is the square of the exponential curvature (Efron's curvature) of M, and
J (t,) = 1 - t/{2u, ()},
J 2 (t,) = 1 - t/[2u 2 ()tanh{tu 2 ()}],
i = 1 for the one-sided case and i = 2 for two-sided case, n being the standard
normal density function, and u^() and u 2 () being the one-sided and two-sided
100% points of the normal density, respectively.
The theorem shows that a third-order admissible test is characterized by its c value. It is interesting that the third-order power loss function
2
(3.18) depends on the model M only through the statistical curvature , so that
2
Pj(t)/ gives a universal power loss curve common to all the statistical
models.

It depends only on the value of c. Various widely used tests will next

be shown to be third-order admissible, so that they are characterized by c


values as follows.
Theorem 3.7. The test based on the maximum likelihood estimator
(e.g. Wald test) is characterized by c = 0. The likelihood ratio test is characterized by c = 1/2. The locally most powerful test is characterized by c = 1
2
in the one-sided case and c = 1 - l/{2u2()} in the two-sided case. The conditional test conditioned on the approximate ancillary statistic a = H l
is characterized also by c = 1/2. The efficient-score test is characterized by

50

Shun-ichi Amari

c = 1, and is inadmissible in the two-sided case.


We show the universal third-order power loss functions of various
tests in Fig. 6 in the two-sided case and in Fig. 7 in the one-sided case,
where = 0.05 (from Amari (1983a)).

It is shown that the likelihood ratio test

has fairly good performances throughout a wide range of t, while the locally
most powerful test behaves badly when t > 2. The m.l.e. test is good at around
t = 3^4.
We can generalize the present theory to the multi-parameter cases
with and without nuisance parameters. It is interesting that none of the
above tests are third-order admissible in the multi-parameter case. However, it
is easy to modify a test to get a third-order t^-efficient test by the use of
the asymptotic ancillary statistic a (Kumon and Amari, 1985). We can also
design the third-order tQ-most-powerful confidence region estimators and the
third-order minimal size confidence region estimators.
It is also possible to extend the present results of estimation and
testing in a statistical model with nuisance parameter . In this case, a set
M ( U Q ) of distributions in which the parameter of interest takes a fixed value
ufi, but takes arbitrary values, forms a submanifold. The mixture curvature
and the exponential twister curvature of M(u Q ) are responsible for the higherorder characteristics of statistical inference. The third-order admissibility
of the likelihood ratio test and others is again proved. See Amari (1985).

Differential Geometrical Theory of Statistics

N4P(t)/r

d = 0.05

-two- sided tests


efficient score test
locally most powerful test
m.l.e. test

o.s

likelihood ratio test

Figure 6

NP (t)/

= 0.05

Ohe-sided t ts
score test
(locally most porfffuf test)
m.l.e. test
likelihood ratio test

Figure 7

51

4.

INFORMATION, SUFFICIENCY AND ANCILLARITY


HIGHER ORDER THEORY

Information and conditional information


Given a statistical model M = p(x,u)}, u = (u a ), we can follow
Fisher and define the amount 9 a k(T) of information included in a statistic
T = t(x) by
g a b (T) = E[8a*(t,u)8b(t,u)] ,

(4.1)

where (t,u) is the logarithm of the density function of t when the true parameter is u. The information g at) (T) is a positive-semidefinite matrix depending
on u. Obviously, for the statistic X, g a b (X) is the Fisher information matrix.
Let T(X) and S(X) be two statistics. We similarly define, by using the joint
distribution of T and S, the amount g b (T,S) of information which T and S together carry. The additivity

9ab(T'S> " W T >

hb^

does not hold except when T and S are independent. We define the amount of
conditional information carried by T when S is known by
g a b (T|S) = E s E T | $ [8 a (t|s,u)a b (t|s,u)] ,

(4.2)

where (t|s,u) is the logarithm of the conditional density function of T conditioned on S. Then, the following relation holds,
9ab(T'S)
From gab(S|"O

9ab(T)

9 a b^ T ' ) " g ab^ T ^' w e

9ab ( S l T > = 9 a b ( S )
see

t h a t the

conc

9ab(TlS)

*itional information

denotes the amount of loss of information when we discard s from a pair of


statistics s and t, keeping only t. Especially,
52

Differential Geometrical Theory of Statistics

53

is the amount of loss of information when we keep only t(x) instead of keeping
the original x. The following relation is useful for calculation,
g a b (T) = ECov[3a<(x,u),3bs,(x,u)|t] ,

(4.4)

W S I T > = W 1 ) - 9ab(T S) .

(4.5)

where Cov[.|t] is the conditional covariance.


A statistic S is sufficient, when 9 a b (S) = 9 a b W or g a b (S) = 0.
When S is sufficient, 9 a b (" s ) = holds for any statistic T. A statistic a is
ancillary, when 9 a b (A) = 0. When A is ancillary, 9 a b (T 5 A) = 9 a b ( l A ) f o r a n y
It is interesting that, although A itself has no information, A together with
another statistic T recovers the amount

of information. An ancillary statistic carries some information in this sense,


and this is the reason why an ancillarity is important in statistical inference.
We call g . (A|T) the amount of information of ancillary A relative to statistic
T.
When N independent observations x,,...,xN are available, the Fisher
information 9 a b (X ) is Ng a b (X), N times that of one observation. When M is a
curved exponential family, x = X./N is a sufficient statistic, keeping the
whole information, g . (X) = Ngb (X)

Let t(x) be a statistic which is a func-

tion of x. It is said to be asymptotically sufficient of order q, when

9ab

{T) =

(T

9abW-9ab >

= 0(N

'

q+1

>

Similarly, a statistic t(x) is said to be asymptotically ancillary of order q,


when
g a b () = o(N- q )
holds.

(4.7)

(The definition of the order in the present article is different from

that by Cox (1980) etc.)

54

Shun-ichi Amari

4.2 Asymptotic efficiency and anciarity


Given a consistent estimator u(x) in an (n,m)-curved exponential
family M, we can construct the associated ancillary family A. By introducing
an adequate coordinate system v in each A(u), the sufficient statistic x is decomposed into two statistics (u,v) by x = (u,v). The amount g .(U) of information loss of estimator u is calculated from (4.4) by using the stochastic expansion of 3a(x,u) as
-K

Hence, when and only when A is orthogonal, i.e., g (u) = 0, u is first-order


aK

sufficient. In this case, u is (first-order) efficient. The loss of information of an efficient estimator u is calculated as

W > = <<)fb + 0 / 2 ) 0 5 4 t o i i r 1 ) '

(4 8

>

where (H^) is the square of the exponential curvature of the model M and (H^)
is the square of the mixture curvature of the associated ancillary family A at
v = 0. Hence, the loss of information is minimized uniformly in u, iff the
mixture curvature of the associated ancillary family A(u) vanishes at v = 0 for
all u. In this case, the estimator u is third-order efficient in the sense of
the covariance in 3. The m.l.e. is such a higher-order efficient estimator.
Among all third-order efficient estimators, does there exist one
whose loss of information is minimal at all u up to the term of order N~ ? Is
the m.l.e. such a one? This problem is related to the asymptotic efficiency of
estimators of order higher than three. By using the Edgeworth expansion (3.9)
and the stochastic expansion of 3 a (x,u), we can calculate the terms, which
a
depend on the estimator, of the information loss of order N~ in geometrical
terms of the related ancillary family. The loss of order N~ includes a term
related to the derivatives of the mixture curvature H \ l of A in the direction
Ka

of
a and
(unpublished
note). loss
Fromgthis
formula, one can conclude that
there
exist3a no
estimators whose
a b (U) of information is minimal up to

the term of order N~ at all u among all other estimators. Hence, the loss of
information of the m.l.e. is not uniformly minimal at all u, when the loss is

Differential Geometrical Theory of Statistics

55

evaluated up to the term of order N~ .


We have already obtained the Edgeworth expansion up to order N~ of
the joi.'t distribution of (u,v), or equivalently (u*,v*) in (3.9). By integration, we have the distribution of v*,
p(v*;u) = n(v*;g ){l + 1 /N K h + OCN" 1 )),
where g ,(u) and K
K

(4.9)

(u) depend on the coordinate system v introduced to each

K y

A(u). The information g u(V*) of v* can be calculated from this. It depends on


the coordinate system v, too. It is always possible to choose a coordinate
system v in each A(u) such that {d } is an orthonormal system at v = 0, i.e.,
g

(u) = 6 . Then, v* is first-order ancillary.


K

It is always possible to

choose such a coordinate system that K

(u) = 0 further holds at v = 0 in eyery

A(u). This coordinate system is indeed given by the ( = - l/3)-normal coordinate system at v = 0. The v* is second-order ancillary in this coordinate
system. By evaluating the term of order N" in (4.9), we can prove that there
exists in general no third-order ancillary v.
However, Skovgaard (1985), by using the method of Chernoff (1949),
showed that one can always construct an ancillary v of order q for any q by
modifying v successively. The q-th order ancillary v i s a function of x
depending on N. Hence, our previous result implies only that one cannot in
general construct the third-order ancillary by using a function of x not depending on N, or by relying on an ancillary family A = {A(u)} not depending on N.
There is no reason to stick to an ancillary family not depending on N, as
Skovgaard argued.
4.3 Decomposition of information
Since (u,v) together are sufficient, the information lost by summarizing x into u is recovered by knowing the ancillary v. The amount of
recovered information g b (V|U) is equal to g a b (U).

Obviously, the amount of

information of v relative to u does not depend on the coordinate system of A(u).

In order to recover the information of order 1 in 9 a b (U)> not all the components of v are necessary. Some functions of v can recover the full information

56

Shun-ichi Amari

of order 1. Some other functions of v will recover the information of order N


and some others further will recover the information of order N . We can decompose the whole ancillary v into parts according to the order of the magnitude
of the amount of relative information.
The tangent space T (A) of the ancillary subspace A(u) associated
with an efficient estimator u is spanned by n - m vectors 3 . The ancillary v
can be regarded as a vector v = v a belonging to T (A). Now we decompose T (A)
as follows. Let us define
K

a= ( v e ) ' v e ) K >' P>2

a
a

Vi

<4'10)

a
P

which is a tensor representing the higher-order exponential curvature of the


model. When p = 2, it is nothing but the exponential curvature HI?

, and when

p = 3, K . represents the rate of change in the curvature ? , and so on.


For fixed indices a,,...,a , K

is a vector in T (S), and its projection

to T (A) is given by
Let T M (A)_ (p > 2) be the subspace of T,,(A) spanned by vectors K
Up-

,...K

Tu(A)p.

We call

, and let P

-|do

be the orthogonal projection from T (A) to

the p-th order exponential curvature tensor of the model M, where I = (I ) is


the identity operator. The square of the p-th order curvature is defined by
, 2xp _HH (e)
n(e)
M ab " a a . . a p _ 1 H b b . . b p _ l

( H }

g a i b i g
"*

VlVl

"

( I
*1

(4

fe)
There exists a finite p n such that H v ; a vanishes for p > p n .
a
a

0
r p
"
Now let us consider the following sequence of statistics,

T, = {G},

T = H^ e ] (G)v5... .

Moreover, let ta = da ,(x,u), which vanishes if u is the m.l.e. Obviously, the


sequence T 9 , T , ... gives a decomposition of the ancillary statistic v = (v )

Differential Geometrical Theory of Statistics

57

into the higher-order curvature directions of M. Let

V V

2 .W

= {

T }

p-r

Then, we have the following theorems (see Amari (1985)).


as
Theorem 4.1. The set of statistics is asymptotically
sufficient
of order p. The statistic T carries information of
f order p relative to ,,
(4 1 3 )

Theorem 4.2. The Fisher information g a b (X) = Ng . (X) is decomposed


into

<W*> " pl
The theorems imply the following. An efficient estimator u carries
all the information of order N. The ancillary v, which together with u carries
the remaining smaller-order information, is decomposed into the sum of p-th
order curvature-direction components a

= H* e '

- . . . a

v , which carries all

d . . . a,

]
]
p

? to
the missing information of order N-D+2
relative
,. The proof is obtained

by expanding 3a (x,u), where = u - u, as


3 *(x,u) = 9 (X,U) + 5 ^

p i

33

pi

3 (x,u)al ...P

a i . . . a

and by c a l c u l a t i n g g k ( T I ) . The i n f o r m a t i o n c a r r i e d by 3 a
3a &(x,u)
D p pI
a - ...

is equivalent to (3 B
a d

)B .v or h e ^
KI

v relative to , up to the

-| . .

P~ '

necessary order.
4.4. Conditional inference
When there exists an exact ancillary statistic a, the conditionality
principle requires that statistical inference should be done by conditioning on
a. However, there exist no non-trivial ancillary statistics in many problems.
Instead, there exists an asymptotically ancillary statistic v, which can be
refined to be higher-order ancillary. The asymptotic ancillary statistic carries information of order 1, and is very useful in improving higher-order
characteristics of statistical inference. For example, the conditional covariance of an efficient estimator is evaluated by

58

Shun-ichi Amari

N Cov[u a ,u b |v] = (g a b + H ^ V ) " 1 + higher order terms ,


where g . + Hjv'v = - 3 a,(x,u) is the observed Fisher information. When two
groups of independent observations are obtained, we cannot get a third-order
efficient estimator for the entire set of observations by combining only the two
third-order efficient estimators u, and iL for the respective samples. If we
can use the asymptotic ancillaries H jj^v* and H ^ v ^ , we can calculate the
third-order efficient estimator (see Chap. 5). Moreover, the ancillary H^^v
can be used to change the characteristics of an efficient test and of an
efficient interval estimator. We can obtain the third-order t^-efficient test
or interval estimator by using the ancillary for any given t~.

It is interest-

ing that the conditional test conditioned on the asymptotic ancillary v is


third-order admissible and its characteristic (deficiency curve) is the same as
that of the likelihood-ratio test (Kumon and Amari (1983)).
In the above discussions, it is not necessary to refine v to be a
higher-order asymptotic ancillary. The curvature-direction components H \ yv
are important, and the other components play no role. Hence, we may say that
H k v is useful not because it is (higher-order) ancillary but because it recovers necessary information.

It seems that we need a more fundamental study on

the invariant structures of a model to elucidate the conditionality principle


and ancillarity (see Kariya (1983), Barndorff-Mielsen, (1937).) There are
many interesting discussions in Efron and Hinkely (1978), Hinkley (1980), Cox
(1980), Barndorff-Nielsen (1980). See also Amari (1985).

5.

FIBRE-BUNDLE THEORY OF STATISTICAL MODELS

Hiibert bundle of a statistical model


In order to treat general statistical models other then curved
exponential families, we need the notion of fibre bundle of a statistical model.
Let M = q(x,u)} be a general regular m-dimensional statistical model parametrized by u - (u a ). To each point uM, we associate a linear space H consisting
of functions r(x) in x defined by
H u = r(x)|Eu[r(x)] = 0, E u [r 2 (x)]<~},

(5.1)

where E denotes the expectation with respect to the distribution q(x,u).


Intuitively, each element r(x)H denotes a direction of deviation of the distribution q(x,u) as follows. Let q(x) be a small disturbance of q(x,u), where
is a small constant, yielding another distribution q(x,u) + q(x), which does
not necessarily belong to M. Here, q(x)dP = 0 should be satisfied. The
logarithm is written as
log{q(x,u) + q(x)} = *(x,u) +
where (x,u) = log q(x,u).

If we put

it satisfies E u [r(x)] = 0. Hence, r(x)Hu denotes the deviation of q(x,u) in


the direction q(x) = r(x)q(x,u). The condition E u [r ]< implies that we consider only deviations having a second moment.

(Note that given r(x)Hu, the

function
q(x,u) + r(x)q(x,u)

59

60

Shun-ichi Amari

does not necessarily represent a probability density function, because the


positivity condition
q(x,u) + r(x)q(x,u) > 0
might be broken for even when is an infinitesimay small constant.)
We can introduce an inner product in the linear space H u by
<r(x),s(x)> = Eu[r(x)s(x)]
for r(x), s(x)H . Thus, H is a Hubert space. Since the tangent vectors
M ( x > u ) , which span T M (M), satisfy E[a ] = 0, E[(8 ) 2 ] = g a a (u)<~, they belong

a.

O.O.

to H . Indeed, the tangent space T (M) of M at u is a linear subspace of H ,


and the inner product defined in T is compatible with that in H . Let N be
the orthogonal complement of T in H u

Then, H u is decomposed into the direct

sum
H

Tu

The aggregate of all H 's attached to eyery uM with a suitable


topology,
H(M) = U U M H u .

(5.2)

is called the fibre bundle with base space M and fibre space H. Since the fibre
space is a Hubert space, it is called a Hubert bundle of M. It should be
noted that H and H , are different Hubert spaces when u f u 1 . Hence, it is
convenient to establish a one-to-one correspondence between H and H ,, when u
and u1 are neighboring points in M. When the correspondence is affine, it is
called an affine connection. Let us assume that a vector r(x)H at u corresponds to r(x) + dr(x)H + . at a neighboring point u + du, where d denotes
infinitesimay small change. From
E u + ( J u [r(x) + dr(x)] = j{q(x,u) + dq(x,u)Mr(x) + dr(x)}dP
= E u [r] + E u [dr(x) + a a *(x,u)r(x)du a ] = 0
and E [r] = 0, we see that dr(x) must satisfy
E d r ] = - E[3_A] du a ,
u

where we neglected higher-order terms. This leads us to the following defini-

Differential

Geometrical Theory of Statistics

61

tion of the -connection: When dr(x) is given by


dr(x)

=-

E[3 a r] dua - - s^rdu* ,

(5.3)

the correspondence is called the -connection. More formally, the -connection


is given by the following -covariant derivative v^'.

Let r(x,u) be a vector

field, which attaches a vector r(x,u) to ewery point uM. Then, the rate of
the intrinsic change of the vector r(x,u) as u changes in the direction 3

is

a
given by the -covariant derivative,

where E[3 a &r] = - E[3^r]


a is used. The -covariant derivative in the direction
a
A = A 3 a T u (M) is given by
> ) = A a v ( ) m
A
9
a
The 1-connection is called the exponential connection, and the -1-connection is
called the mixture connection.
When we attach the tangent space T (M) to each point uM instead of
attaching the Hubert space H , we have a smaller aggregate

K M ) =U U M Tu(M) .
which is a subset of H_ called the tangent bundle of M. We can define an affine
connection in 1(M) by introducing an affine correspondence between neighboring
T u and Tu,. When an affine connection is given in
H(M) such that rH,
u corresponds to r + drH + . , it naturally induces an affine connection in J(M) such
that rT (M)CH corresponds to the orthogonal projection of r + d r H u + d u to
T +H (M). It can easily be shown that the geometry of M is indeed that of J(M),
so that the -connection of (M) or M, which we have defined in Chapter 2, is
exactly the one which the present -connection of H_(M) naturally induces.
Hence, the -geometry of H_(M) is a natural extension of that of M.
Let u = u(t) be a curve in M. A vector field r ( ' t ) H u ( t )

defined

along the curve is said to be -parallel, when


v

( ) r =r . ^ E u [ r ]

^ r i =0

(5.5)

62

Shun-ichi Amari

is satisfied, where r denotes ar/at, etc. A vector r..()H is the -parallel


shift of rn(x)H
U
r

along a curve u(t) connecting u n = u(t n ) and u = u(t ), when


U

I I

n( ) = r(x,t Q ) and r,(x) = r(x,t-.) in the solution r(x,t) of (5.5).


The parallel shift of a vector r(x) from u to u1 in general depends

on the curve u(t) along which the parallel shift takes place. When and only
when the curvature of the connection vanishes, the shift is defined independently of the curve connecting u and u 1 . We can prove that the curvature of h[(M)
always vanishes for = 1 connections, so that the e-parallel shift ( = 1) and
the m-parael shift ( = - 1) can be performed from a point u to another point
u1 independently of the curve. Let ^

and ^ m ' U be the e- and m-parallel

shift operators from u to u'. Then, we can prove the following important
theorem.
Theorem 5.1. The exponential and mixture connections of h[(M) are
curvature-free. Their parallel shift operators are given, respectively, by
(e)
/ r ( x ) = r(x) - E u ,[r(x)] ,
(5.6)
.

(5.7)

The e- and m-connections are dual in the sense of

<r,s>u = < ^ . K / s V .
where <.,.>

is the inner product at u.

Proof. Let c: u(t) be a curve connecting two points u = u(0) and u1 = u(l).
Let r ^ ( x , t ) be an -parallel vector defined
define along the curve c. Then, it
satisfies (5.5). When = 1, it reduces to

Since the right-hand side does not depend x, the solution of this equation with
e

the initial condition r(x) = r' '(x,0) is given by


(e)

r (x,t) = r(x) + a(t) .


where a(t) is determined from
e

E u ( t ) [r( >(x,t)] = 0

Differential Geometrical Theory of Statistics

63

as
a(t) = - E u ( t ) [r(x)] .
This yields (5.6), where we put u(t) = u 1 . Since E ,[r(x)] does not depend on
the path connecting u and u 1 , the exponential connection is curvature free.
Similarly, when = -1, (5.5) reduces to
f ( m ) (x,t) + r (m) (x,t)(x,u(t)) = 0 .
The solution is
r (m) (x,t)q(x,u(t)) = a(x) ,
which yields (5.7). This shows that the mixture connection is also curvature
free. The duality relation is directly checked from (5.6) and (5.7).
We have defined the imbedding -curvature H ? of a curved exponential family. The concept of the imbedding curvature (which sometimes is called
the relative or Euler-Schouten curvature) can be defined for a general M as
follows. Let PN be the projection operator of H to N which is the orthogonal
subspace of T (M) in H . Then, the imbedding -curvature of M is a function in
x defined by
ab

u 3g b

' '

which is an element of N c H . The square of the -curvature is given by

The scalar = 9 (Hy| ) a b is the statistical curvature defined by Efron in the


one-dimensional case.
5.2.

Exponential bundle
Given a statistical model M = {q(x,u)}, we define the following

elements in H ,
X1

~\
"

n I y

a X> \

2ab

i \
, U /

3a

lb '

= v(> X

X
AI

ka.. .3.

3a

k a ^ . . .ai,
f k^

and attach to each point uM the vector space T v 'J spanned by these vectors,

64

Shun-ichi Amari

where we assume that they are linearly independent. The aggregate


i('k)(M) =U U M T u ( ' k )

(5.9)

with suitable topology is then called the -tangent bundle of degree k of M.


All the -tangent bundles of degree 1 are the same, and are merely the tangent
bundle J(M) of M. In the present paper, we treat only the exponential (i.e.,
= 1) tangent bundle of degree 2, which we call the local exponential bundle
of degree 2, although it is immediate to generalize our results to the general
-bundle of degree k. Note that when we replace the covariant derivative v^ '
by the partial derivative 3, we have the so-called jet bundle. Its structures
v;
are the same as the exponential bundle, because v(9)
reduces to 3 in the

logarithm expression 3 (x,u) of tangent vectors.


(1 2)

(2)

The space T ' , which we will also more briefly denote by r ',
is spanned by vectors X, and X 2 , where X-, consists of m vectors
X.(x,u) = 3 (x,u), a = l,...,m
u

and Xp consists of m(m + 1)/2 vectors


X a b (x,u) = v ^ e ) 3 b = 3 a 3 b (x,u) + g a b ( u ) , a, b = l,...,m .
a
(See Fig. 8.) We often omit the indices a or a, b in the notation X or X . ,
;
(2)
briefly showing them as X j or X 2 Since the space T^
consists of all the
linear combinations of X-j and X 2 , it is written as
T < 2 ) = {'.(x,u)}
1

ab

where the coefficients = ( ^ ) consist of = ( ), = ( ) , and


fl]X + 2 X
l

The set X forms a basis of the linear space ^ K

2)' is
The
tensor of |
The metric

then giveri b y
g

Here

ij

<X i ,X j > = E [ X i ( x ' u ) > :j(x.u)] .

denotes an m x mmatrix
<X
X >
9n = a ' b = E[9 13 b i]1 = g a b

Differential

Geometrical Theory of S t a t i s t i c s

65

Figure 8
which is the metric tensor of the tangent space T (M) of M. The component
g

21

12

re

Presents

921 W

" < X ab'V = ibc '

Similarly, go is a quantity having four indices


g

22 =

<X

abfXcd> '

The exponential connection can be introduced naturally in the local


exponential fibre bundle v ; (M) of degree 2 by the following principle:
1) The origin of T A

corresponds to the point

(2)
X Idu = Xa(x,u)du a T
u

2) The basis vector X (x,u + du)T v + is mapped to T v ' by 1


(2)
parallely shifting it in the Hubert bundle h[ and then projecting it to
.
v
(2)
(2)
We thus have the affine correspondence of elements in T, ^ and T ',
iX M

X.(u + du) +-> X.(u) + dX. = X.(u) + H.X.(u)du ,


1

I I

l J

i are the coefficients of the exponential affine connection in(2)


where i
T (M).
aj

The coefficients are given from the above principle (2) by

al = a1 = a 6 3' lz = 9]5E[^^i(x,u)l

(5.10)

We remark again that the index i = 1 stands for a single index b, for example,
and i = 2 stands for a pair of indices, for example b, c.

66

Shun-ichi Amari

Let (u) = (u)X .(x,u)T^ be a point in J^2\ We can shift the


(2)
(2)
point (u)" ' to point (u')", belonging to another point u1 along a curve
u = u(t). Since the point (u)X. (u)" ' corresponds to the point ^ u + du)
v
(X. + dX.) + X,duT(2)
{ , where dX. is determined from the affine connection and
+

the last term X,du corresponds to the change in the origin, we have the following equation
1 + r \ j a + u V = 0 .
aj

(5.11)

whose solution (t) represents the corresponding point in T (i}> where =


aa ( U ) .

Note that we are here talking about the parallel shift of a point in

affine spaces, and not about the parallel shift of a vector in linear spaces
where the origin is always fixed in the latter case.
v
Let u1 be a point close to u. Let (u' u) be the point in T(2)
'

in)

corresponding to the origin (u') = 0 of the affine space T ,'. The map depends
in general on the curve connecting u and u 1 . However, when |u' - u| is small,
the point (U' U ) is given by
^u' u) = j(u'-u) + \ j (u'-u) 2 + 0(|u'-u| 3 ) .
Hence, if we neglect the term of order |u'-u| , the map does not depend on the
route. In the component form,
a(u';u) = u' a -u a
2(u';u) = bc (u';u) = \ (u' b -u b )(u' C -u C ) ,

(5.12)

where we neglected the term of order |u'-u| . Since the origin (u') = 0 of
T (2)
,; can be identified with the point u' (the distribution q(x,u')) in the model
M, this shows that, in the neighborhood of u, the model M is approximately re;
presented in (2)
as a paraboloid given by (5.12).
Let us consider the exponential family E = {p(x,;u)l depending
on u, whose density function is given by

p(x>;u) = q(x,u)exp{ X (x,u) - u ()} ,

(5.13)

(2)
ine space "
where is the natural parameter. We can identify the affine
with
i (2)
(2)
the exponential family E u , by letting the point = X^j ' represent the

Differential Geometrical Theory of Statistics

67

Figure 9
distribution p(x,;u)Eu specified by . We call E the local exponential
family approximating M at u. The aggregate

with suitable topology is called the fibre bundle of local exponential family of
degree 2 of M. The metric and connection maybe defined from the resulting identification of E!(M) with j} '(M). The distribution q(x,u) exactly corresponds to
the distribution p(x,0;u) in E u , i.e., the origin = 0 of E or V '. Hence,
the point = (u' u) which is the parallel shift of (u') = 0 at E ,, is the
counterpart in E of the q(x,u')M, i.e., the distribution p{x,(u',u); u}E
is an approximation in E of q(x,u')M. For a fixed u, the distributions
\ = {q(x,u';u)} ,
q(x,u' u) = p{x,(u' u ) ; u}
form an m-dimensional curved exponential family imbedded in E (Fig. 9). The
point of this construction is that M is approximated by a curved exponential
family M in the neighborhood of u. The tangent spaces T (M) of M and T (M )
of M exactly correspond at u, so that their metric structures are the same at
u. Moreover, the squares of the imbedding curvatures are the same for both M
and P(u at u, because the curvature is obtained from the second covariant

68

Shun-ichi Amari

derivative of X,i = 3a . This suggests that we can solve statistical inference


problems in the curved exponential family M instead of in M, provided u is sufficiently close to the true parameter u Q .
5.3. Statistical inference in a local exponential family
Given N independent observations x m

(N)

we can

define the

observed point (u)E , for each u, by


1

n.,.(u) = X.(u) = i j X i ( x ( j ) , u ) .

(5.14)

We consider estimators based on the statistics (u). We temporarily fix a point


u, and approximate model M by M , which is a curved exponential family imbedded
in E . Let e be a mapping from E to M that maps the observed X(u)E to the
estimated value e(u) in M when u is fixed, by denoting it as
e(u) = e{X(u);u} .
The estimated value depends on the point u at which M is approximated by M .
The estimator e defines the associated ancillary family A = {A (u 1 ), u'M }
for every u, where
A u (u') = e ' V u) = {Eu|e(;u) = u 1 } .
When the fixed u is equal to the true parameter u n , M approximates M yery
uQ
well in the neighborhood of u n

However, we do not know u Q . To get an estima-

tor u from e, let us consider the equation


e{X(u);u} = u .
The solution u of this equation is a statistic.

It implies that, when M is

approximated at u, the value of the estimator e at EQ is exactly equal to u.


The characteristics of the estimator u associated with the estimator e in M are
given by the following geometrical theorems, which are direct extensions of the
theorems in the curved exponential family.
Theorem 5.2. An estimator u derived from e is first-order efficient
when the associated ancillary family A is orthogonal to M . A first-order
efficient estimator is second-order efficient.
Theorem 5.3. The third-order term of the covariance of a bias corrected efficient estimator is given by

Differential Geometrical Theory of Statistics

_ 1fr(mh2 +
3ab " 2 (
'ab

69

(e),2+ ( 1
# (mh2
M >ab 2 H A ] ab '

(H H

The bias corrected maximum likelihood estimator is third-order efficient,


because the associated ancillary family has vanishing mixture curvature.
The proof is obtained in the way sketched in the following. The
true distribution q(x,u Q ) is identical with the distribution q(x,(u n );u n ) at
un of the curved exponential family Mu . Moreover, when we expand q(x,u) and
0
q(x 9 (u) ,u 0 ) at u Q in the Taylor series, they exactly coincide up to the terms
2
of u-u 0 and (u-u Q ) , because E is composed of X, and X ? .

Hence, if the estima-

tion is performed in E , we can easily prove that Theorems 5.2 and 5.3 hold,
because the Edgeworth expansion of the distribution u is determined from the
expansion of A ( X , U ) up to the second order if the bias correction is used. However, we do not know the true u Q , so that the estimation is performed in EQ.
In order to evaluate the estimator u, we can map E- (and M-) to M by the
u
u
u0
exponential connection.

In estimating the true parameter, we first summarize N

observations into X(u) which is a vector function of u, and then decompose it


into the statistics X(u) = {^(u),X 2 (u)}, where e(X(u);u) = u. The X 2 (u) becomes an asymptotic ancillary. When the estimator is the m.l.e., we have X- (u)
= 0 and Xc.
v Y in Mur . The theorems can be proved by calculating the
0 (u) = HauK
Edgeworth expansion of the joint distribution of X(u) or (u,v). The result is
the same as before.
We have assumed that our estimator e is based on X(u). When a
general estimator
f

u'= (X(i)
( N ))
is given, we can construct the related estimator given by the solution of
re
e f (X/x;u) = u, where
e f (X;u) = E u [f(x ( 1 ) ,...,x ( N ) )|X(u) = X] .
Obviously, e f (X;u) is the conditional expectation of u1 given X(u) = X. By
virtue of the asymptotic version of the Rao-Blackwell theorem, the behavior of
1

e f is equal to or better than u up to the third-order. This guarantees the

70

Shun-ichi Amari

validity of the present theory.


The problem of testing the null hypothesis HQ:u = u Q against
hi., : u f u Q can be solved immediately in the local exponential family E . When
HQ is not simple, we can also construct a similar theory by the use of the
statistics u and X(u). It is possible to evaluate the behaviors of various
third-order efficient tests. The result is again the same as before.
We finally treat the problem of getting a better estimator u by
gathering asymptotically sufficient statistics X(u)'s from a number of independent samples which are subject to the same distribution q(x,u 0 ) in the same
model. To be specific, let X M )i X M ) N

anc

' x (2)l 9 # ''' X (2}N ^e t w o

nc

*epen-

dent samples each consisting of N independent observations. Let u, and u ? be


the m.l.e. based on the respective samples. Let X/ \(u.) be the observed point
in E, i = 1, 2. The statistic X/.%
consists of two components X/
\ =
u..
[i
\i) i

and

(Y

Cin

it

- c

f"hp

is satisfied. The statistic u. carries the whole information of order N


included in the sample and the statistic X 2 (u ) which is asymptotically ancillary, carries whole information of order 1 together with u.. Obviously
(e) K
curvature-d
is the curvature-direction
component statistic, X/ M O = ab (i) l n t'e curvec '
exponential family
Given two sets of statistics (u., x (-j)2^i^ i = 1, 2, which
summarize the original data, the problem is to obtain an estimator u, which is
third-order efficient for the 2N observations. Since the two statistics X(u )
give points n/ \= X(u ) in the different E- , in order to summarize them it is
necessary to shift these points in parallel to a common E ,. Then, we can
average the two observed points in the common E , and get an estimator u in
this E ,. The parallel affine shift of a point in E to a different E , has
already been given by (5.11) in the -coordinate system. This can be rewritten
1
in the -coordinate system. In particular, when du = u - u is of order N -1/2
-117
and (u) is also of order N~ ' , the parallel affine shift of (u)E u to Eu is

Differential Geometrical Theory of Statistics

71

given in the following expanded form for = ( ^ , ^ ) , ^ = (f,a) and L = ( , ),


a (u') = n a (u) + g a b d u b - n a b (u)du b + 1 ^ d u b d u c + 0(N" 3 / 2 )

Mow, we shift the two observed points X/ \(u ) to a common E ,,


1

where u may be any point between u-j and u 2 , because the same estimator u is
obtained up to the necessary order by using any E ,. Here, we simply put
u' = (u-j + u 2 )/2, and let be
= (u1 - u 2 )/2 .

^
Then, the point X , . ( U . ) is shifted to X/ \(u') of E , as

and we get similar expressions for X,~\ by changing to -. Since u. is the


m.l.e., X/ \a = 0. The average of X,-.x and X/ 2 > in the common E , gives the
estimated observed point X(u') = (X jXp) from the pooled statistics (u ,X/.

l " 2 ( X 2 a b X l a b ) + 2 b c a 6
"" 1
X
2 = 2 ( X 2ab + X lab )

'

By taking the m.l.e. in E , based on (X, , X 2 ) , we have the estimator


-a

=u

,a 1 ab,z x.c ^ 1 ab^fmkc.d


" 2 9 (2bc " X l b c ) + 2g cdb 6 6 '

which indeed coincides with that obtained by the equation e(u) = u up to the
third order. Therefore, the estimator u is third-order efficient, so that it
coincides with the m.l.e. based on all the 2N observations up to the necessary
order.
The above result can be generalized in the situation where k
asymptotically sufficient statistics ( u ^ X / ^ ^ ) a r e given in EQ , i = l,...,k,
u. being the m.l.e. from N independent observations. Let
u1 = N i u i /N i .
Moreover, we define the following matrices

72

Shun-ichi Amari

ab -I iab '

=K W "

Then, we have the following theorem.


Theorem 5.4. The bias corrected version of the estimator defined by

u" = G a b [ , G. b c u]
is third-order efficient.
This theorem shows that the best estimator is given by the weighted
average of the estimators from the partial samples, where the weights are
given by G. . . It is interesting that G. , is different from the observed
Fisher information matrix
J

Tab

They are related by

See Akahira and Takeuchi [1981] and Amari [1985].

6.

ESTIMATION OF STRUCTURAL PARAMETER IN THE PRESENCE


OF INFINITELY MANY NUISANCE PARAMETERS

Estimating function and asymptotic variance


Let M = {p(x;,)} be a family of probability density functions of
a (vector) random variable x specified by two scalar parameters and . Let
x,, Xp,...,xN be a sequence of independent observations such that the i-th
observation x. is a realization from the distribution p(x;,.), where both
and . are unknown. In other words, the distributions of x. are assumed to be
specified by the common fixed but unknown parameter and also by the unknown
parameter . whose value changes from observation to observation. We call
the structural parameter and the incidental or nuisance parameter. The problem is to find the asymptotic best estimator .. = N (x, ,Xp5 .. ,x N ) of the
structural parameter , when the number N of observations is large. The asymptotic variance of a consistent estimator is defined by
AV(,) = lim V [ / N ( M - )]
(6.1)
N
N-**>
where V denotes the variance and denotes an infinite sequence = ( . 2 )
of the nuisance parameter. An estimator is said to be best in a class C of
estimators, when its asymptotic variance satisfies, at any ,
AV[,] < A V [ \ ]

for all allowable and for any estimator ' C. Obviously, there does not
necessarily exist a best estimator in a given class C.
Now we restrict our attention to some classes of estimators. An
estimator is said to belong to class C Q , when it is given by the solution of
the equation
73

74

Shun-ichi Amari
N
. y ( ) = o ,

where y(x,) is a function of x and only, i.e., it does not depend on . The
function y is called the estimating function.

Let C-j be a subclass of C Q , con-

sisting of all the consistent estimators in C Q . The following theorem is well


known (see, e.g., Kumon and Amari [1984]).
Theorem 6.1. An estimator C Q is consistent if and only if its
estimating function y satisfies
E j [y(x,J] = 0 ,
where E Q

E j [8 y(x,)] + 0 ,

denotes the expectation with respect to p(x;,) and 3 n = 3/3. The

asymptotic variance of an estimator C, is given by


A V ( , ) = lim N V[y(x )] /{(3 0 y)} 2 ,
where 8oy(x.,)/N is assumed to converge to a constant depending on and .

i
Let H

(M) be the Hubert space attached to a point (,)M,

Hfl J M ) = a(x) I Efl [a] = 0 , E. [ a 2 ] < }.


The tangent space T

(M) <= H

(M) is spanned by u(x;,) = 3 Q (x;,) and

v(x;,) = 3 (x;,) . Let w be


w(x .) = u - < u > 2 > v ,
<v >
o

where <v > = <v,v>.

Then, the partial information g A is given by

g = g - v
where g

2 /

= <w2>

'

2
2
= <u >, g ^ = <v >, g ^ = <u,v> are the components of Fisher informa-

tion matrix.

The theorem shows that the estimating function y(x,) of a con-

sistent estimator belongs to H

for any . Hence, it can be decomposed as

" 9

y(x,) = a(,)u(x;,) + b(,)v(x;,) + n(x;,) ,

where n belongs to the orthogonal complement of T r in H n r , i.e.,


,
,
<u,n> = <v,n> = 0 .
The class C, is often too large to guarantee the existence of the
best estimator.

A consistent estimator is said to be uniformly informative

Differential Geometrical Theory of Statistics

75

(Kumon and Amari, 1984) when its estimating function y(x,) can be decomposed as
y(x,) = w(x;,) + n(x;,) .
The class of the uniformly informative estimators is denoted by C..,. A uniformly informative estimator satisfies
<y,w> f = < w 2 > j = g (, ) .
Let Cjy be the class of the information unbiased estimators introduced by
Lindsay [1982], which satisfy a similar relation,
<VW>
= <V >
y
y
' w ,
,

Note that <y,w> = <y,u> holds.


Let us define the two quantities
g() = lim 1 <n(x;,.)2> ,
N*
which depends on the estimating function y(x,) and
g() = limJi g (, .) ,
which latter is common to all the estimators. Then, the following theorem gives
a new bound for the asymptotic variance in the class C,.. (see Kumon and Amari
(1984)).
Theorem 6.2. For an information unbiased estimator
AV[;]

= g"1 + g" 2 g .

We go further beyond this theory by the use of the Hubert bundle theory.
6.2.

Information, nuisance and orthogonal subspaces


We have already defined the exponential and mixture covariant de(r

rivatives v ^ and v ) in the Hubert bundle H = U, r xH r ( M ) . A field


r(x;,)HA
,

(M) defined at all (,) is said to be e-invariant, when v ; e V = 0


d

holds. A field r(x;,) is said to be strongly e-invariant (se-invariant),


when r does not depend on . A se-invariant field is e-invariant. An estimating function y(x,) belonging to C-j is an se-invariant field, and conversely,
an se-invariant y(x,) gives a consistent estimator, provided <u,y> j 0.
Hence, the problem of the existence of a consistent estimator in C Q reduces to

*>

Shun-ichi Amari

the problem of the existence of an se-invariant field in the Hubert bundle


H(M).
We next define the subspace H r of H r by
>

<

= U , { ( m ^ , a ( x ) I a(x)T j l } ,

i.e., the subspace composed of all the m-parael shifts to (,) of the vectors
belonging to the tangent space T
H

, at all (,')'s with common . Then,

is decomposed into the direct sum

where H.

is the orthogonal complement of HT r. We call H^r the orthogonal

subspace at (,). We next define the nuisance subspace H^r at (,) spanned
by the m-parallel shifts

':!,v from (, ) to (,) of the -score vectors


1

v(x;,') = 3 for all .

It is a subspace of H_r > so that we have the

decomposition
I is the orthogonal complement of H
where H>
inNT
HQ,_. It is called the
,
information subspace at (,). Hence,

Any vector r(x;,)H

can uniquely be decomposed into the sum,


,

r(x;,) = r ^ x .) + rN(x;,) + r(x;,) ,


where r ^ H

, r H

N
r

(6.2)

and rH? r are called respectively the I-, N- and 0 ,

parts of r.
We now define some important vectors. Let us first decompose the
-score vector u = 3 A &T n

into the three components. Let u (x;,)H_

be the I-part of the -score uT A r. We next define the vector


x . 1 ) = (m) ^,u(x;, )
in H
,

, which is the m-shift of the -score vector uT

(6.3)
, from (,*) to
,

(,). Let 1 be its I-part. The vectors ^ x j 1 ) in H* where (,) is


fixed, form a curve parametrized by ' in the information subspace H

. When

Differential Geometrical Theory of Statistics


al o f

g l l U ^ x . ' M

77

lie in a hyperplane in lC for all 1 , we say

that u are coplanar. In this case, there exists a vector w H:

for which

<wI,GI(x;,;')> = g (')

(6.4)

holds for any 1 . The vector w (w;,)HQ r is called the information vector.
When it exists, it is unique.
6.3. Existence theorems and optimality theorems
It is easy to show that a field r(x;,) is se-invariant if its
identically Hence, any estimating function
nuisance partt rN vanishes identically.
y(x,)C, is decomposed into the sum
y(x>) = y ^ x .) + y(x;,) .
We can prove the following existence theorems.
Theorem 6.3. The class C, of the consistent estimators is nonempty
if the information subspace H A _ includes a non-zero vector.
Theorem 6.4. The class Cy, of the uniformly informative estimators
in C, is nonempty, if (x;,;') are coplanar. All the uniformly informative
estimators have the identical I-part y (x;,), which is equal to the information vector w (x;,).
Outline of proof of Theorem 6.3. When the class C, is nonempty,
there exist an estimating function y(x,) in C-.. It is decomposed as
y(x,) = y ^ x .) + y(x;,) .
Since y is orthogonal to the tangent space H

we have

<y,u> = 0 .
By differentiating <y(x,)> = 0 with respect to , we have
0 = <ay> + <y,u>

Since <dv> = 0, we see that y ^ x .) j 0, proving that H


includes a non-zero vector. Conversely, assume that there exists a non-zero
vector a(x,) in H*r for some . Then, we define a vector
,

y(x;,') = ( e ) f a(x,) = a(x,) - E ,[a]

Shun-ichi Amari

in each H.r, by shifting a(x,) in parallel in the sense of the exponential


connection.

By differentiating <a>

= E Ja"] with respect to , we have

3 <a> = <a a> + <a,v> = 0 ,


because a does not include and a is orthogonal to H

. This proves

Hence, the above y(x .') does not depend on ' so that it is an estimating
function belonging to C j.

Hence, C-j is nonempty, proving theorem 6.3.

Outline of proof of Theorem 6.4.

Assume that there exists an

estimating function y(x,) belonging to Cyj. Then, we have


<y,u(x;,)> j = g () ,
because of <y,v> = 0. Hence, when we shift y in exponential parallel and we
shift u in mixture parallel along the -axis, the duality yields

or
<y I (x;,), iiI(x;,;l)>= g Q (') .
This shows that are coplanar, and the information vector w is given by
projecting y to H

,. Conversely, when are coplanar, there exists the

information vector w H:r. We can extend it to any 1 by shifting it in exponential parallel,

y(,e) =( e ) ^ V ,
which yields an estimating function belonging to Cyj.
The classes C, and Cyj are sometimes empty.
example later.

We will give an

Even when they are nonempty, the best estimators do not neces-

sarily exist in C-. and in Cx... The following are the main theorems concerning
best estimators.

(See Lindsay (1982) and Begun et al. (1983) for other

approaches to this problem.)


Theorem 6.5.

A best estimator exists in C,, iff the vector field

u (x;,), which is the I-part of the -score u, is e-invariant.

The best

estimating function y(x,) is given by the e-invariant u , which in this case


is se-invariant.

Differential Geometrical Theory of Statistics

79

Theorem 6.6. A best estimator exists in C. j, iff the information


vector w (x;,) is e-invariant.

The best estimating function y is given by

the e-invariant w , which in this case is se-invariant.


Outline of proofs.
function is y(x,).

Let be an estimator in C-j whose estimating

It is decomposed into the following sum,

y(x,) = c(,) u 1 + a J (x;,) + y(x;,) ,


where u (x,) is the projection of u(x;,) to H, _., c(,) is a scalar, and
j

a H

_ is orthogonal to u

in H Q r .
,

The asymptotic variance of is calculated

as
A V [ ; ] = lim N { ( C2 A. + B . ) } / { ( C . A . ) 2 } ,


N^co

where = (^g,...), c. = c(,.), and


A = <u ,u > > ,

B, = <(a I (x)) 2 >

+ <(y)2>

From this, we can prove that, when and only when B. = 0, the estimator is
uniformly best for all sequences . The best estimating function is u
for = (,,, . . . ) . Hence it is required that u

is se-invariant.

(x;,)

This

proves Theorem 6.5. The proof of Theorem 6.6 is obtained in a similar manner
by using w
6.4.

instead of u .

Some typical examples:

nuisance exponential family

The following family of distributions,


p(x;,) = exp{s(x,) + r(x,) - (,)}
is used frequently in the literature treating the present problem.

(6.5)
When

is fixed, it is an exponential family with the natural parameter , admitting


a minimal sufficient statistic s(x,) for . We call this an n-exponential
family.

We can elucidate the geometrical structures of the present theory by

applying it to this family.

The tangent vectors are given by

u = afts + a r - a ,

v = s - 3 .

The m-parallel shift of a(x) from (,') to (,) is

80

Shun-ichi Amari

( m

^.

a(x) = a(x)exp{( - ')s - () + (')} .

From this follows a useful Lemma.


Lemma. The nuisance subspace H. _ is composed of random variables
of the following form,
H

, =

{f

C s ( ' ) " c ( >)]} .

where f is an arbitrary function and c(,) = E [f(s)]. The I-part a of


a(x) is explicitly given as
a ! (x) = a(x) - E. [a(x) | s(x,)] ,

(6.6)

by the use of the conditional expectation E[a|s]. The information subspace


H^

is given by

for any f, where h = 3 f + f.


We first show the existence of consistent estimators in C, by
applying Theorem 6.3.
Theorem 6.7. The class C-. of consistent estimators is nonempty in
an n-exponential family, unless both s and r are functionally dependent on s,
i.e., unless
1

(a^) = O ^ ) = 0 .
On the other hand, a consistent estimator does not necessarily exist
in general. We give a simple example: Let x = (x-^Xp) be a pair of random
variables taking on two values 0 and 1 with probabilities
P(x! = 0) = 1/(1 + exp + }) ,
P(x 2 = 0) = 1/(1 + exp{k()}) ,
where k is a known nonlinear function. The family M is of n-exponential type
only when k is a linear function. We can prove that H

= {0}, unless k is

linear. This proves that there are no consistent estimators in this problem.
Now we can obtain the best estimator when it exists for
n-exponential family. The I-part of the -score u is given by

Differential Geometrical Theory of Statistics

81

It is e-invariant, when and only when (3.s) = 0.


Theorem 6.8. The optimal estimator exists in C-j when and only when
(3 Q s) = 0, i.e., 30s(x,e) is functionally dependent on s. The optimal
estimating function is given in this case by the conditional score u 1 = (3 r ) 1 =

3 Q r - E[3 r I s], and moreover the optimal estimator is information unbiased in

this case.
According to Theorem 6.4, in order to guarantee the existence of
uniformly informative estimators, it is sufficient to show the coplanarity of
(x;,;'), which guarantees the existence of the information vector
w(x;,)H* r. By putting w = hsMs.s) 1 + f(s)(3 r ) 1 , this reduces to the
integral-differential equation in f,
<w,'(3 s) 1 + (3 r) I > I = g (') .

(6.7)

When the above equation has a solution f(s;,), are coplanar and the information vector w exists. Moreover, we can prove that when (3 Q r) = 0, the
information vector w is e-invariant.
Theorem 6.9. The best uniformly informative estimator exists when
(3 r) = 0. The best estimating function is given by solving

Efl .,[h(s)V[3 s I s]] = gflfl(')/' ,

(6.8)

where h(s ) does not depend on and V[a s | s] is the conditional covariance.
We give another example to help understanding.

Let x = (x-^Xp) be

a pair of independent normal random variables, x-,^N(,l), X2^N(,l). Then,


the logarithm of their joint density is
2
2
A(x .) = - \ [(x1 - ) + (x 2 - ) - log(2)]
= s(x,) + r(x,) - (,) ,
where s(x,) = x , + X2 2,, r(x,) = - (x2 + x2)/2, (,) = 2(l + 2)/2 +
log(2).

From 3s = x 0 , dr = 0, we have
= (x2 - x^/d + 2 ) ,

(Sg)1 = 0.

82

Shun-ichi Amari

Hence, from Theorems 6.7 and 6.8, the class C, is nonempty, but the best
estimator does not exist in C-.. Indeed, we have
u J (x;,) = (x 2 - X-, )/( + 2 ) ,
which depends on so that it is not e-invariant. Since any vector w in H
can be written as
w = h(s)(3 s) 1
for some h(s;,), the information vector w (x;,)H

can be obtained by

solving (6.4) or (6.7), which reduces in the present case to


E > [h(s)(x 2 - X l )] = (l + 2 ) .
Hence, we have
h(s) = s/(l + 2 ) ,
which does not depend on . Therefore, there exists a best uniformly informative estimator whose estimating function is given by
y(x,) = w J (x,) = h(s)(8 s) T = (x 2 - X 1 ) ( X 1 + X 2 ) / ( 1 + 2 ) 2
or equivalently by (x - x,)(x, + X 2 ) .
not information unbiased.

This is the m.l.e. estimator. This is

7.

PARAMETRIC MODELS OF STATIONARY GAUSSIAN TIME SERIES

-representation of spectrum
Let M be the set of all the power spectrum functions S() of
zero-mean discrete-time stationary regular Gaussian time series, S() satisfying the Paley-Wiener condition,
Nog S()d > - oo .
Stochastic properties of a stationary Gaussian time series x t h t = ..., -1, 0,
1, 2, ..., are indeed specified by its power spectrum S(), which is connected
with the autocovariance coefficients c. by
C. = 27 f S() COStd ,

(7.1)

S() = CQ + 2 tlQ C t COSt ,

(7.2)

where
c

t=

E[x

rVt]

for any r. A power spectrum S() specifies a probability measure on the


sample space X = {x t } of the stochastic processes. We study the geometrical
structure of the manifold M of the probability measures given by S(). A
specific parametric model, such as the AR model MAR of order n, is treated as a
submanifold imbedded in M.
Let us define the -representation jr '() of the power spectrum
S() by

r - {S()
{S() ,
(7.3)
log S() ,

83

= 0 .

84

Shun-ichi Amari

(Remark: It is better to define the -representation by - (l/)[S()"- 1].


However, calculations are easier in the former definition, although the following discussions are the same for both representations.) We impose the regularity condition on the members of M that J T 0 ^ can be expanded into the Fourier
series for any as
ft () =

2 . i COSt ,

(7.4)

where
.

= \SL

()cStd ,

t = 0 1, 2, ....

We may denote the ( ) () specified by = {[ ) } by , ( ) (; ( ^. An


infinite number of parameters U i } together specify a power function by
=|0
()

S(; ) = J

(7.5)
= 0.

Therefore, they are regarded as defining an infinite-dimensional coordinate


system in M. We call ' the -coordinate system of M. Obviously, the -1coordinates are given by the autocovariances, }~ ' = c... The negative of the
1-coordinates | , which are the Fourier coefficients of S" (), are denoted
by c. and are called the inverse autocovariances, }

- c+

7.2. Geometry of parametric and non-parametric tine-series models


Let M be a set of the power spectra S( u) which are smoothly
specified by an n-dimensional parameter u = (u a ), a = 1, 2, ..., n, such that
M becomes a submanifold of M., e.g., M could be an autoregressive process.
This M is called a parametric time-series model. However, any member of M can
be specified by an infinite-dimensional parameter u, e.g., by the -coordinates

() = {| )}, t = 0, 1, ... in the form S ( , ^ ) .

The following discussions

are hence common to both the parametric and non-parametric models, irrespective
of the dimension n of the parameter space.
We can introduce a geometrical structure in M or M in the same
manner as we introduced before in a family of probability distributions on

D i f f e r e n t i a l Geometrical Theory of S t a t i s t i c s

85

sample space X, except that X = { t > is infinite-dimensional in the present


time-series case (see Amari, 1983 c ) . Let p ( x , , . . . ,x , u) be the j o i n t proba b i l i t y density of the T consecutive observations x,,...,x_ of a time series
specified by u. Let
* ( x 1 , . . . , x ; u ) = log p ( x ] , . . . s x ; u ) .

Then, we can introduce in M or M the following geometrical structures as


before,

9ab (u)= l im TE [ 3 a W ]
-*

Ull- j E [ { W-WbVV ]
However, the limiting process is tedious, and we define the geometrical structure in terms of the spectral density S() in the following.
Let us consider the tangent space T at u of M or M , which is
spanned by a finite or infinite number of basis vectors 3 = a/3u associated
a
with the coordinate system u. The -representation of 9 is the following funca
tion in ,
Hence, in M, the basis d[a' associated with the -coordinates ^ is
1 ,

t= 0

2C0St ,

t =f 0 .

Let us introduce the inner product g . of 9 and 3. in T by

where E is the operator defined at u by


2

E [a( )] = |{S(w;u)} a()d .


The above inner product does not depend on , and is written as

We next define the -covariant derivative Vg 3 bo f a b nt h e

ye>

Shun-ichi Amari

direction of 3aa by the projection of 3a3,Djr ' to Tu . Then, the components of


the -connection are given by
auc

d u e

aD

If we use O-representation, it is given by


ia 3.log
S - 3alog S3.log
S)3c log S d .
u
o
From (7.4) and (7.7), we easily see that the -connection vanishes in M
identically, if the -coordinate system ^ ' is used. Hence, we have
Theorem 7.1. The non-parametric M is -flat for any . The
-affine coordinate system is given by ^'.

The two-coordinate systems ^ '

and ^~' are mutually dual.


Since M is -flat, we can define the -divergence from S.() to
S 2 () in M.

It is calculated as follows.
Theorem 7.2. The -divergence from S, to Sp is given by
/ 2 ) Jf {[SI9 ()/SI ()] - 1 - lg[S29 /S1 ]}d , j 0
(1/2) f [log S () - log S 9 ()] 2 d ,

= 0.

7.3. -flat models


An -model M of order n is a parametric model such that the
-representation of the power spectrum of a member in M| is specified by n + 1
parameters u = (u ), k = 0, 1,... ,n, as
r )(;u) = u n + 2 . , u. cos k .
U

K"~ I

Obviously, M is -flat (and hence --flat), and u is its -affine coordinate


system.

The AR-model MAR


n of order n consists of the stochastic processes
defined recursively by

Jo Vt-k =t
where {.} is a white noise Gaussian process with unit variance and a = (a Q ,
a.,...,a ) is the (n+1)-dimensional parameter specifying the members of M n .

Differential

Geometrical Theory of S t a t i s t i c s

87

Hence, i t is an (n+1)-dimensional submanifold of M. The power spectrum S( a)


of t u e process specified by a is given by
r/

IV

i ki-2

S( a) = | k ^ 0 ake

We can calculate the geometric quantities of MAR in terms of the AR-coordinate


system a

the above expression.


Similarly, the MA-model M

of order n is defined by the pro-

cesses
n
x

k=0 b k t-k

where b = (b n , b, ,...,b ) is the MA-parameter. The power spectrum S( b) of


the process specified by b is
FX P

The exponential model M

S( b) = | b k e

|.

of order n introduced by Bloomfield (1973) is com-

posed of the following power spectra S( e) parameterized by e = (e Q , e,,...,


n
S( e) = exp{e Q + 2 ^ Q e k cos k} .
It is easy to show that the 1-representation of S( a) in M is
given by

n
c. = . a.a. . ,

k = 0, l,...,n

c. = 0 ,

k > n

where

This shows that M n is a submanifold specified by c k = 0, (k > n) in M. Hence,


it coincides exactly with a one-model Nr ', although the coordinate system a is
not 1-affine but curved. Similar discussions hold for MMA.
Theorem 7.3. The AR-model M n coincides with M^', and hence is
l-flat. The MA-model M n coincides with M^~ ', and hence is also l-flat.
FXP
(0)
The exponential model M^

coincides with M^ , and is 0-flat. Since it is

self-dual, it is an (n+1)-dimensional Euclidean space with an orthogonal


Cartesian coordinate system e.

88

Shun-ichi Amari

7.4. -approximation and -projection


Given a parametric model M = {S( u)}, it is sometimes necessary
to approximate a spectrum S() by one belonging to M . For example, given a
finite observations x,, ..., x of {xt)> one tries to estimate u in the parametric model M by obtaining first a non-parametric estimate S() based on x,, ...,
X and then approximating it by S(;u)M . The -approximation of S is the one
that minimizes the -divergence D [S(), S(,u)], uM n

It is well known that

the -1-approximation is related to the maximum likelihood principle. As we


have shown in 2, the -approximation is given by the -projection of S() to
M . We now discuss the accuracy of the -approximation. To this end, we consider a family of nested models {M n } such that M Q I D M 1 ID M 2 => .. .M^ = M. The
{M

nR}'{MnA}

and { M

nXP}

are nested

models, in which M Q is composed of the white

noises of various powers.


Let {M } be a family of the -flat nested models, and let S ( u )
M be the --approximation of S(), where u is the (n+1)-dimensional parameter
given by
S

nMn

The error of the approximation by S M is measured by the --divergence


D_ (S,S n ). We define
= D. (S.S n ) .
S

(7.8)

nMn

I t i s an i n t e r e s t i n g problem to f i n d out how E (S) decreases as n increases.


We can prove the following Pythagorean r e l a t i o n (Fig. 10).

The following theorem is a direct consequence of this relation.


Theorem 7.4. The approximation error EL(S) of S is decomposed as
E

n ^ = Jn - A + A >

(7 9)

Differential Geometrical Theory of Statistics

Figure

89

10

Hence,

= JoD-c
The theorem is proved by the Pythagorean relation for the right
triangle S S S Q composed of the -geodesic S S Q included in MJ and --geodesic
SS intersecting at S perpendicularly. The theorem shows that the approximation error E (S) is decomposed into the sum of the --divergences of the
successive approximations S. , k = n+ ,...,>, where S^ = S is assumed. Moreover, we can prove that the --approximation of S. in M (n < k) is S . In
K
n
n
other words, the sequence {S } of the approximations of S has the following
property that S is the best approximation of S.K (k > n) and that the approximation error E (S) is decomposed into the sum of the --divergences between the
further successive approximations. This is proved from the fact that the 1

geodesic in M connecting two points S and S belonging to M" is completely included in MJ for an -model M .
Let us consider the family {M } of the AR-models. It coincides
with M . Let S be the -1-approximation of S. Let c.(S) and c.(S) be, respectively, the autocovariances and inverse autocovariances. Since c. and c^
are the mutually dual -1-affine and 1-affine coordinate systems, the -1-approx-

90

S h u n - i c h i Amari

imation S of S i s determined by the following r e l a t i o n s


1)

ct(Sn) = ct(S),

t = 0, 1 , . . . , n

2)

ct(Sn) = 0 ,

t = n+1, n+2, . . . .

This implies that the autocovariances of S are the same as those of S up to


t = n, and that the inverse autocovariances c. of S vanish for t > n. Similar
relations hold for any other -flat nested models, where c. and c. are replaced
FXP }
by the dual pair of - and --affine coordinates. Especially, since {M
are the nested Euclidean submanifolds with the self-dual coordinates ^ , their
properties are extremely simple.
We have derived some fundamental properties of -flat nested parametric models. These properties seem to be useful for constructing the theory
of estimation and approximation of time series. Although we have not discussed
about them here, the ARMA-modes, which are not -flat for any , also have interesting global and local geometrical properties.
Acknowledgements
The author would like to express his sincere gratitude to Dr. M.
Kumon and Mr. H. Nagaoka for their collaboration in developing differential
geometrical theory. Some results of the present paper are due to joint work
with them. The author would like to thank Professor K. Takeuchi for his
encouragement. He also appreciates valuable suggestions and comments from the
referees of the paper.

REFERENCES

Akahira, M. and Takeuchi, K. (1981). On asymptotic deficiency of estimators


in pooled samples. Tech. Rep. Limburgs Univ. Centr. Belgium.
Amari, S. (1968). Theory of information spaces

a geometrical foundation of

the analysis of communication systems. RAAG Memoirs 4_9 373-418.


Amari, S. (1980). Theory of information spaces

a differential geometrical

foundation of statistics. POST RAAG Report, No. 106.


Amari, S. (1982a). Differential geometry of curved exponential families

curvatures and information loss. Ann. Statist. J_0, 357-387.


Amari, S.

(1982b). Geometrical theory of asymptotic ancillarity and conditional inference. Biometrika 69, 1-17.

Amari, S.

(1983a). Comparisons of asymptotically efficient tests in terms of


geometry of statistical structures. Bull. Int. Statist. Inst.,
Proc. 44th Session, Book 2, 1190-1206.

Amari, S. (1983b). Differential geometry of statistical inference, Probability


Theory and Mathematical Statistics (ed. Ito, K. and Prokhorov,
J. V . ) , Springer Lecture Notes in Math 1021, 26-40.
Amari, S.

(1983c). A foundation of information geometry. Electronics and


Communication in Japan, 66-A, 1-10.

Amari, S.

(1985). Differential-Geometrical Methods in Statistics. Springer


Lecture Notes in Statistics, 28, Springer.

Amari, S. and Kumon, M.

(1983). Differential geometry of Edgeworth expansions

in curved exponential family, Ann. Inst. Statist. Math. 35A,


1-24.
91

92

Shun-ichi Amari

Atkinson, C. and Mitchell, A. F. (1981). Rao's distance measure, Sankya A43,


345-365.
Barndorff-Nielsen, 0. E. (1980). Conditionality resolutions. Biometrika 67,
293-310.
Barndorff-Nielsen, 0. E. (1987). Differential and integral geometry in
statistical inference. IMS Monograph, this volume.
Bates, D. M. and Watts, D. G. (1980). Relative curvature measures of nonlinearity, J. Roy. Statist. Soc. B40, 1-25.
Beale, E. M. L. (1960). Confidence regions in non-linear estimation. J. Roy.
Statist. Soc. B22, 41-88.
Begun, J. M., Hall, W. J., Huang, W.-M. and Wellner, J. A. (1983).

Informa-

tion and asymptotic efficiency in parametric-nonparametric models.


Ann. Statist. Jl_, 432-452.
Bhattacharya, R. N. and Ghosh, J. K. (1978). On the validity of the formal
Edgeworth expansion. Ann. Statist. J5, 434-451.
Bloomfield, P. (1973). An exponential model for the spectrum of a scalar time
series. Biometrika 60, 217-226.
Burbea, J. and Rao. C. R. (1982). Entropy differential metric, distance and
divergence measures in probability spaces: A unified approach.
J. Multi. Var. Analys. W, 575-596.
Chentsov, N. N. (1972). Statistical Decision Rules and Optimal Inference
(in Russian). Nauka, Moscow, translated in English (1982), AMS,
Rhode Island.
Chernoff, H. (1949). Asymptotic studentization in testing of hypotheses,
Ann. Math. Stat. 20, 268-278.
Cox, D. R. (1980). Local ancillarity. Biometrika 67, 279-286.
Csiszar, I. (1975).

I-divergence geometry of probability distributions and

minimization problems. Ann. Prob. 3_ 146-158.


Dawid, A. P. (1975). Discussions to Efron's paper. Ann. Statist. 3> 12311234.

Differential Geometrical Theory of Statistics

93

Dawid, A. P. (1977). Further comments on a paper by Bradley Efron. Ann.


Statist. 5, 1249.
Efron, B. (1975). Defining the curvature of a statistical problem (with
application to second order efficiency) (with Discussion). Ann.
Statist. 3.. 1189-1242.
Efron, B. (1978). The geometry of exponential families. Ann. Statist. 6^,
362-376.
Efron, B. and Hinkely, D. B. (1978).

Assessing the accuracy of the maximum

likelihood estimator: Observed versus expected Fisher information


(with Discussion). Biometrika 65, 457-487.
Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in
a curved exponential family. Ann. Statist. 11, 793-803.
Hinkely, D. V. (1980). Likelihood as approximate pivotal distribution.
Biometrika 67, 287-292.
Hougaard, P. (1983). Parametrization of non-linear models. J. R. Statist.
Soc. B44, 244-252.
James, A. T. (1973). The variance information manifold and the function on it.
Multivariate Analysis (ed. Krishnaiah, P. K.), Academic Press,
157-169.
Kariya, T. (1983). An invariance approach in a curved model. Discussion paper
Ser. 88, Hitotsubashi Univ.
Kass, R. E. (1980). The Riemannian structure of model spaces: A geometrical
approach to inference. Ph.D. Thesis, Univ. of Chicago.
Kass, R. E. (1984). Canonical parametrization and zero parameter effects
curvature. J. Roy. Statist. Soc. B46, 86-92.
Kumon, M. and Amari, S. (1983). Geometrical theory of higher-order asymptotics
of test, interval estimator and conditional inference, Proc. Roy.
Soc. London A387, 429-458.
Kumon, M. and Amari, S. (1984). Estimation of structural parameter in the
presence of a large number of nuisance parameters. Biometrika 71,
445-459.

94

Shun-ichi Amari

Kumon, M. and Amari, S. (1985).

Differential geometry of testing hypothesis:

a higher order asymptotic theory in multiparameter curved exponential family, METR 85-2, Univ. Tokyo.
Lauritzen, S. L. (1987). Some differential geometrical notions and their use
in statistical theory.

IMS Monograph, this volume.

Lindsay, B. G. (1982). Conditional score functions: Some optimality results.


Biometrika 69, 503-512.
McCullagh, P. (1984). Tensor notation and cumulants of polynomials.
Biometrika 71, 461-476.
Madsen, L. T. (1979). The geometry of statistical model

a generalization

of curvature. Research Report, 79-1, Statist. Res. Unit., Danish


Medical Res. Council.
Nagaoka, H. and Amari, S. (1982). Differential geometry of smooth families of
probability distributions, METR 82-7, Univ. Tokyo.
Pfanzagl, J. (1982).

Contributions to General Asymptotic Statistical Theory.

Lecture Notes in Statistics J^3, Springer.


Rao, C. R. (1945).

Information and accuracy attainable in the estimation of

statistical parameters. Bull. Calcutta. Math. Soc. 37^ 81-91.


Reeds, J. (1975). Discussions to Efron's paper. Ann. Statist. _3, 1234-1238.
Skovgaard, Ib. (1985). A second-order investigation of asymptotic ancillarity,
Ann. Statist. 13., 534-551.
Skovgaard, L. T. (1984). A Riemannian geometry of the multivariate normal
model, Scand. J. Statist. U_9 211-223.
Yoshizawa, T. (1971). A geometrical interpretation of location and scale
parameters. Memo TYH-2, Harvard Univ.

DIFFERENTIAL AND INTEGRAL GEOMETRY IN STATISTICAL INFERENCE


0. E. Barndorff-Nielsen

1.

Introduction

97

2.

Review and Preliminaries

99

3.

Transformation Models

118

4.

Transformation Submodels

127

5.

Maximum Estimation and Transformation Models

130

6.

Observed Geometries

135

7.

Expansion of c | j |^L

8.

Exponential Transformation Models

152

9.

Appendix 1

154

10. Appendix 2

156

11. Appendix 3

157

12. References

159

147

^Department of Theoretical Statistics, Institute of Mathematics, University of


Aarhus, Aarhus, Denmark

95

1.

INTRODUCTION

This paper gives an account of some of the recent developments in


statistical inference in which concepts and results from integral and differential geometry have been instrumental.
A great many important contributions to the field of integral and
differential geometry in statistics are not discussed or even referred to here,
but a rather comprehensive overview of the field can be obtained from the material compiled in the present volume and from the survey paper by BarndorffNielsen, Cox and Reid (1986).
Section 2 reviews pertinent parts of statistics and of integral
and differential geometry, and introduces some of the terminology and notation
that will be used in the rest of the paper.
A considerable part of the material in sections 3, 4, 5 and 8 and
in the appendices, which are mainly concerned with the systematic theory of
transformation models and exponential transformation models, has not been published elsewhere.
Sections 6 and 7 describe a theory of "observed geometries" and its
relation to an asymptotic expansion of the formula c|j| C for the conditional
distribution of the maximum likelihood estimator; the results there are mostly
taken from Barndorff-Nielsen (1986a). Briefly speaking, the observed geometries on the parameter space of a statistical model consist of a Riemannian
metric and an associated one-parameter family of affine connections, constructed from the observed information matrix and from an auxiliary statistic a chosen such that (,a), where denotes the maximum likelihood estimator of the
97

98

0. E. Barndorff-Nielsen

parameter of the model, is minimal sufficient. The observed geometries and the
closely related expansion of c|j|^L form a parallel to the "expected geometries"
and the associated conditional Edgeworth expansions for curved exponential
families studied primarily by Amari (cf., in particular, Amari 1985, 1986), but
with some essential differences. In particular, the developments in sections 6
and 7 are, in a sense, closer to the actual data and they do not require integrations over the sample space; instead they employ "mixed derivatives of the
log model function." Furthermore, whereas the studies of expected geometries
have been largely concerned with curved exponential families the approach taken
here makes it equally natural to consider other parametric models, and in particular transformation models. The viewpoint of conditional inference has been
instrumental for the constructions in question. However, the observed geometrical calculus, as discussed in section 6, does not require the employment of
exact or approximate ancillaries.
The observed geometries provide examples of the concept of
statistical manifolds discussed by Lauritzen (1986).
Throughout the paper examples are given to illustrate the general
results.

2. REVIEW AND PRELIMINARIES

We shall consider parametrized statistical models M^ specified by


(X^p(x;),) where X. is the sample space, is the parameter space and p(x )
is the model function, i.e. p(x ) = dP /dy for some dominating measure y. The

dimension of the parameter will usually be denoted by d and we write on


coordinate form as ( ,..., ). Generic coordinates of will be indicated as
r s t .
, , , etc.
The present section is organized in a number of subsections and it
serves two purposes: to provide a survey of previous results and to set the
stage for the developments in the following sections.
Combinants.

It is useful to have a term for functions which depend

on both the observation x and the parameter and we shall call any such function a combinant.
Jacobians. Our vectors are row vectors and we denote transposition of a matrix by an asterix *.

If f is a differentiate transformation of

a space _Y then the Jacobian af/ay* of f at yY_ is also denoted by Jr (y), while
we write J f (y) for the Jacobian determinant, i.e. J f = |JJ . When appropriate
we interpret J f (y) as an absolute value, without explicitly stating this. We
shall repeatedly use the fact that for differentiate transformations f and g
we have
Jf

o g

( y ) = Jg(y)J f (g(y))

(2.1)

Jf

0 g

(y) = J f (g(y))J g (y).

(2.2)

and hence

99

100

0. E. Barndorff-Nielsen

Foliations. A partition of a manifold of dimension k into submanifolds all of dimension m<k is called a foliation and the submanifolds are said
to be the leaves of the foliation.
A dimension-reducing statistical hypothesis may often, in a natural
way, be viewed as a leaf of an associated foliation of the parameter space .
Likelihood. We let L = L() = L( x) denote an arbitrary version
of the likelihood function for and we set 1 = log L. Furthermore, we write
3 r = 3/3, and 1 = a 1, 1

= 3 9 1, etc. The observed information is the

matrix
j() = -[l r s ]

(2.3)

i() = E j().

(2.4)

and the expected information is

The inverse matrices of j and i are referred to as observed and expected formation, respectively.
Suppose the minimal sufficient statistic t for M^ is of dimension k.
We then speak of M as a (k,d)-model (d being the dimension of the parameter ).
Let (,a) be a one-to-one transformation of t, where is the maximum likelihood estimator of and a, of dimension k-d, is an auxiliary statistic.
In most applications it will be essential to choose a so as to be
distribution constant either exactly or to the relevant asymptotic order. Then
a is ancillary and according to the conditionality principle the conditional
model for given a is considered the appropriate basis for inference on .
However, unless explicitly stated, distribution constancy of a is
not assumed in the following.
There will be no loss of generality in viewing the log likelihood
1 = l() in its dependence on the observation x as being a function of the
minimal sufficient (,a) only. Henceforth we shall think of 1 in this manner
and we will indicate this by writing
1=1(,,a).

Differential and Integral Geometry in Statistical Inference

101

Similarly, in the case of observed information we write


j = j(;,a)
etc.

It turns out to be of interest to consider the function


*(<*>) =*(;a) = l(;,a),

(2.5)

obtained from l(;,a) by substituting for . Similarly we write


() = a-( a) = j( ;,a).

(2.6)

For a general parametric model p(x ) and for a general auxiliary a


a conditional probability function p*(;|a) for given a may be defined by
p*(;|a) = clJl^C

(2.7)

where L is the normed likelihood function, i.e.


C = p(x;)/p(x;),
and where c = c(,a) is a norming constant determined so as to make the integral
of (2.7) with respect to equal to 1.
Suppose now that a is approximately or exactly distribution constant. Then the probability function p*(;|a), given by (2.7), is to be
considered as an approximation to the conditional probability function p(;|a)
of the maximum likelihood estimator given a, cf. Barndorff-Nielsen (1980,
1983).

In general, p*(;|a) is simple to calculate since it only requires

knowledge of standard likelihood quantities plus an integration over the sample


space to determine the norming constant c. Moreover, to sufficient accuracy
this norming constant can often be approximated by (2)~ ' , where d is the
dimension of ; and a more refined approximation to c solely in terms of mixed
derivatives of the log model function is also available, cf. the next subsection
and section 7. In a great number of cases, including virtually all transformation models, p*(;|a) is, in fact, equal to p(;|a). Furthermore, outside
these exactness cases one often has an asymptotic relation of the form
p(;|a) = p*(;|a){l + 0(n" 3/2 )}

(2.8)

uniformly in for /(-) bounded, where n denotes sample size. This holds,
in particular, for (k,d) exponential models. For more details and further

102

0. E. Barndorff-Nielsen

discussion, see Barndorff-Nielsen (1980, 1983, 1984, 1985, 1986a,b) and


Barndorff-Nielsen and Blaesild (1984).
Expansion of c[j| L in the single-parameter case. Suppose is
one-dimensional. From formulas (4.2) and (4.5) of Barndorff-Nielsen and Cox
(1984) we have
cj^L = (-;

(2.9)
3/2

.{1 + 0(n" )}.


Here (w ) denotes the probability density function of the normal distribution
with mean 0 and variance " . Furthermore, C,, A,, and A are given by
C

l = 2?{-3U4

12U

3,1 " 5 U 3

+ 24U

2 , 1 U 3 " 2 4 U 2,1 " 1 2 U 2 , 2 }

(2

10

>

and
^
A 2 (u) = P 3 (u)U 2 > 2

1 2 j l

+ P 2 (u)U 3

P 4 (u)| f l + P 5 (u)U 4 + P 6 ( u ) U 3 J

P 7 (u)U 2

where P (u), i = 1,...,8, are polynomials, the explicit forms of which are
given in Barndorff-Nielsen (1985), and where U =
v
/ x
v
,U, /( )x _ 9 s {r v ; (;,a)}
v,s
.(v+s)/2
*
s

n and U c are defined as


v, u
v, s
= 1,2,3,...

= 0,1,2...

r ^ denoting the v -th order derivative of 1 = l( ;,a) with respect to and


8 S indicating differentiation s times with respect to .

Note that, in the

v+s

repeated sampling situation, U


is of order 0(n"^ " '' ). Hence the
v s
quantities C^, A-j and Ap are of order 0(n

), On"32) and 0(n

), respectively.

Integration of (2.7) yields an approximation to the conditional


distribution of the likelihood ratio statistic
w = 2{l() - l( Q )

(2.11)

Differential

and I n t e g r a l Geometry in S t a t i s t i c a l

Inference

103

for testing a dimension reducing hypothesis Q of . In particular, i f is


a p(r'nt hypothesis, Q = { Q }, we have
p*(w;Q|a) = c e " ^

as an app^ imation to p(w; Q |a).

/
|j|^2d
Iw , a

(2.12)

(The leading term of (2.9) together with

(2.12) yields the usual approximation for w. For a connection to Bartlett


adjustment factors see Barndorff-Nielsen and Cox (1984)).
Furthermore, (2.9) may be integrated termwise to obtain expansions
for the conditional distribution function for and, by inversion, for confi-3/2
dence limits for , correct to order 0(n

), conditionally as well as uncon-

ditionally, cf. Barndorff-Nielsen (1985). The resulting expressions allow one


to carry out "conditional inference without conditioning and without integration."
For extensions to the case of multidimensional parameters see
section 7.
Reparametrization. A basic form of invariance is parametrization
invariance of statistical procedures (though parametrization equivariance might
be a more proper term).

If we think of an inference frame as consisting of the

data in conjunction with the model and a particular parametrization of the


model, and of a statistical procedure as a method which leads from the
inference frame to a conclusion formulated in terms of the parametrization of
the inference frame then parametrization invariance may be formally specified
as commutativity of the diagram
inference
^parametrization
frame

frame

inference

procedure

procedure

conclusion

> conclusion
reparametrization

104

0. E. Barndorff-Nielsen

In words, the procedure is parametrization invariant if changing the inference


base by shifting to another parametrization and then applying yields the same
conclusion as first applying and then translating the conclusion so as to be
expressed in terms of the new parametrization.

(We might describe a parametri-

zation invariant procedure as a 0-th order generalized tensor.) Maximum


likelihood estimation and likelihood ratio testing are instances of parametrization invariant procedures.
Example 2.1. Consider any log-likelihood function l(), of a onedimensional parameter . Define the functions r ^ = r*-v^(), v = 1,2,...,
recursively by
r

Lv]

dr

[i]()

(D()/i()^

il/i) 3 5 ,

v = 2,3,...,

and set f^-* = r ^ ( ) . The derivatives f^-" are parametrization invariant,


LJ
i.e. rvl
takes the same value whatever the parametrization employed.
While parametrization invariance is clearly a desirable property,
there are a number of useful, and virtually indispensable, statistical methods
which do not have this property. Thus procecures which rely on the asymptotic
normality of the maximum likelihood estimator, such as the Wald test or standard ways of setting confidence intervals in non-linear regression problems,
are mostly not parametrization invariant. However, in cases of non parametrization invariance particular caution must be exercised, as demonstrated for
instance for the Wald test by Hauck and Donner (1977) and Vaeth (1985).
We shall be interested in how various quantities behave under
reparametrizations of the model M^. Let , of dimension d, be the parameter of
some parametrization of M, alternative to that indicated by . Coordinates of
will be denoted by p , , etc. and we write 3 for a/3p and
r _ r p
/P

2 r, p

/per

etc. Furthermore, we write l() for the log likelihood under the parametriza-

Differential and Integral Geometry in Statistical Inference

105

tion by , though formally this is in conflict with the notation l(), and
correspondingly we let 1 = 9 1 = 9 l(), etc.; similarly for other parameter
dependent quantities. Finally, the symbol

over such a quantity indicates that

the maximum likelihood estimate has been substituted for the parameter.
Using this notation and adopting the summation convention that if a
suffix occurs repeatedly in a single expression then summation over that suffix
is understood, we have
1 = 1 ,
P
r /p
1

= 1 J. S. + 1 J.

p
1

=1
p

rs /p /

(2.13)
v

r /p

,.4.0)/ 7 , + l^., , [3] + 1 ,


rst /p / /
rs /p / J
r /p

'
(2.14)
'

etc., where [3] signifies a sum of three similar terms determined by permutation
of the indices p,,. On substituting for in (2.13) we obtain the wellknown relation
j

"p = J V s < 7 p >

which, now by substitution of for , may be reexpressed as

or, written more explicitly,


j

( a) = 3-rs(;a) ^ ^
9

Equation (2.15) shows that j is a metric tensor on M, for any given value of the
auxiliary statistic a. Moreover, in wide generality 3- will be positive definite
on M^, and we assume henceforth that this is the case. In fact, for any we
have 3- = j, i.e. observed information at the maximum likelihood point, which is
generally positive definite (though counterexamples do exist).
p
Let A() = [A
()] be an array, depending on and where
s
s
l ' q
each of the p + q indices runs from 1 to d. Then A is said to be a (p,q)

tensor, or a tensor of contravariant rank p and covariant rank q, if under

106

0. E. Barndorff-Nielsen

reparametrization from to A obeys the transformation law


p

.p

s-,

p-j

r ,.. .r

Example 2.2. A covariant tensor of rank q is given by

F J

3i

K l

"

Low

/q

In particular, the expected information i is a (0,2) tensor.


The inverse [i S ] of i = [i ] is a contravariant second order
tensor.
The (outer) product of two tensors A

and B
S

is defined as the array C given by


rlV..
S S

1S2

tl

t 2 ...

This product is again a tensor, of rank (p1 + p", q' + q") if (p',qf) and
(p",q") are the ranks of A and B.
Lower rank tensors may be derived from higher rank tensors by contraction, i.e. by pairwise identification of upper and lower indices (which
implies a summation).
The parameter space as a manifold. The parameter space may be
viewed as a (pseudo-) Riemannian manifold with (pseudo-) metric determined by
a metric tensor , i.e. is a rank 2 covariant, regular and symmetric tensor.
o
The associated Riemannian connection v is determined by the Christoffel symbols
Of

where

tu

?* = ?
rs

rsu

and

If v is any affine connection with connection symbols r


these symbols satisfy

then

Differential and Integral Geometry in Statistical Inference

107

s 3 t

and the transformation law

On the other hand, any set of functions [r ] which satisfy the law (2.18)
constitute the connection symbols of an affine connection on . It follows that
all affine connections on are of the form

where the S

L = r + s
rs
rs rs
are characterized by the transformation law

(2 19)
\^-^)

S p T () = S s () / r p ^; t .

(2.20)

If, for a given metric tensor , we define r . and S . by


rsL

r st

rst = rs tu a n d S rst = S rs tu

then (2.18), (2.19) and (2.20) are equivalent to, respectively,


p

rst

/p / /

tu

/p /

'

rst rst

\t-")

S^^./ # , .
rst /p / /

(2.23)
'

rst
and
=

S
p

Thus, in particular, [S .] is a tensor.


Suppose : -> is a mapping of full rank from an open subset B of
a Euclidean space of dimension d^ < d into . Then is said to be an immersion of B in . We denote coordinates of 3 by , , etc. If is a metric
tensor on then the metric tensor on B induced from by is defined by

ab

(a) =

rs

()

/a /b '

(2

'

24)

If r () is a connection on and if r . = r u . then the induced connection

108

0. E.

Barndorff-Nielsen

on B is defined by r b (e) = a b d ( 3 ) C d ( 3 ) and by

a b c ( 3 ) = rst ( ) /a /b /c + tu /ab /c "

(2

'25)

Let G be a group acting smoothly on the parameter space. A metric


tensor is said to be (G-) invariant if
r'
s'
S () = ^ g ^ s ( g ) ^ 3 i f

G.

(2.26)

For a given g let a new parametrization be introduced by = g. From the


transformation law for tensors it follows that is invariant if and only if
r s () = S (g),

gG.

(2.27)

(On the left hand side the tensor is expressed in coordinates, on the right
hand side in coordinates.) Similarly, a connection r is said to be invariant
1f

rj s () = rj $ (g),

gG.

(2.28)

The pseudo-Riemannian connection derived from an invariant metric tensor is


invariant.
In generalization of (2.27) an arbitrary covariant tensor A
is said to be (G-) invariant if

VV
If r

)=

9 ) 9G>

VV '

is a G-invariant connection and if

invariant tensors, with

being a metric tensor, then r

and S . are Gdefined by

1
?*rs = rrs
+ t ursu
S

is a G-invariant connection.
Now, let be the information tensor i on . Then (2.16) takes the
form
?

rst E { V s V -

Obviously,
= E<y$lt}

(2.29)

D i f f e r e n t i a l and I n t e g r a l Geometry i n S t a t i s t i c a l Inference

109

satisfies (2.23) and hence, for any real an affine connection is defined by
?

rst=ElrsV+E{1V

<2 3 0 >

These are the -connections introduced and studied by Chentsov (1972) and
Amari (1982a,b, 1985, 1986).

However, we shall be mainly concerned with another type of connection, determined from observed information, more specifically from the metric
tensor 3-, see sections 6-8. We refer to i and 3- as expected and observed information metric on i, respectively.
Suppose, as above, that :$ -* is an immersion of B in . The
submodel NL of ^ obtained by restricting to lie in = (B) has expected
information
i(3) = ^ H ) ^

(2.31)

Thus 1(3) equals the Riemannian metric induced from the metric i() on to
the imbedded submanifold Q . Furthermore, the -connection of the model ML
equals the connection on Q induced from the -connection on , by the general
construction (2.25).
The measures on defined by
li^d

(2.32)

\3\

(Z.33)

and
are both geometric measures, relative to expected and observed information
metric, respectively.

Note that (2.33) depends on the value of the auxiliary

statistic a. We shall speak of (2.32) and (2.33) as expected and observed


information measure, respectively.

It is an important property of these mea-

sures that they are parametrization invariant. This property follows from
the fact that i and a" are covariant tensors of rank 2. As a consequence we
have that c|j| L (of (2.7)) is parametrization invariant.
Invariant measures. A measure on x is said to be invariant with
respect to a group G acting on X^ if gy = y for all gG.

110

0. E. Barndorff-Nielsen

Invariant measures, when they exist, may often be constructed from


a quasi-invariant measure, as follows.
A measure on X_ is called quasi-invariant with multiplier
x

x(9x) if 9y and are mutually absolutely continuous for ewery gG and if


d ( g ~ M ( x ) = (g,x)d(x).

Furthermore, define a function m on X to be a modulator with associated


multiplier (g,x) if m is positive and
m(gx) = x(g.x)m(x).
Then, if x is quasi-invariant with multiplier (g,x) and if m is a modulator
with the same multiplier we have that
= m" X

is an invariant measure on _X.


As quasi-invariance is clearly a yery weak property the problem in
constructing invariant measures lies mainly in finding appropriate modulators.
It is usually possible to specify the modulators in terms of Jacobians.
In particular, in applications it is often the case that X_ is an
open subset of a Euclidean space. By the standard theorem on transformation
of integrals, Lebesgue measure on X is then quasi-invariant with multiplier
J / x(x). Under mild conditions an invariant measure on X^ is then given by

d(x) = J ^ J O J W ) .

(2.34)

Here J / % denotes the Jacobian determinant of the mapping (g) of X. onto itself
determined by gG and (z,u) constitutes an orbital decomposition of x, i.e.
(z,u) is a one-to-one transformation of x such that uX^ and u is maximal
invariant while zG and x=zu. For a more detailed discussion see section 3
and appendix 1.
Transformation models. Let G be a group acting on the sample space
)L

If the class P^ of probability measures given by the statistical model is

invariant under the induced action of G on the set of all probability measures
on X then the model is called a composite transformation model and if IP

Differential and Integral Geometry in Statistical Inference

111

consists of a single orbit we use the term transformation model. For a


composite transformation model, G acts on P^ and we may, of course, equally
think of G as acting on the parameter space . A parameter (function) which
is maximal invariant under this action is said to be an index parameter.
Virtually all composite transformation models of interest have the property
that after minimal sufficient reduction (and possibly after deletion of a null
set from X) there exists a sub-group K of G such that K is the isotropy group
for a point on every one of the orbits of X_ and of . Each of these orbits is
then isomorphic to the homogeneous space G/K = {gK:gG} of left cosets of K.
For a transformation model the information measures (2.32) and
(2.33) are invariant measures relative to the action of G on induced from the
action of G on X via the maximum likelihood estimator , which is an equivariant
mapping from _X to . This action is the same as the above-mentioned action of
G on P and also the same as the natural action of G on G/K .
It follows that relative to information measure on the formula
(2.7) for the conditional distribution of is simply cL. From this it may be
shown that, with the auxiliary a as the maximal invariant statistic, p*(,|a)
is exactly equal to p(;|a).
These results are shown in outline in Barndorff-Nielsen (1983). A
more general statement will be derived in section 5.
Exponential models. A (k,d) exponential model has model function of
the form
p(x ) = exp{() t(x) - (()) - h(x)}.

(2.35)

Here k is the order of the model (2.35) and is equal to the common dimension
of the vectors () and t(x), while d denotes the dimension of the parameter .
The full exponential model generated by (2.35) has model function
p(x ) = exp t(x) - () - h(x)}

(2.36)

and () is the cumulant transform of the canonical statistic t = t(x). From


the viewpoint of inference on there is no restriction in assuming x = t,
since t is minimal sufficient, and we shall often do so. We set = () = E t,

112

0. E. Barndorff-Nielsen

i.e. is the mean value parameter of (2.36), and we write T for (int)
where denotes the canonical parameter domain of the full model (2.36).
Let f be a real differentiate function defined on an open subset
k

of R .

The Legendre transform f T of f is defined by


f (y) = y-f()

where

y = (Df)(x) =fj(x) .
The Legendre transform is a useful tool in studying various, dualistic aspects
of exponential models (cf. Barndorff-Nielsen (1978a), Barndorff-Nielsen and
Blaesild (1983a)).
In particular, we may use the Legendre transform to define the
dual likelihood function 1 of (2.35) by
-1
1 () = e () - l(()).

(2.37)

Here, and elsewhere, ' as top index indicates maximum likelihood estimation
under the full model. Further, in this connection we take 1 as the sup-loglikelihood function of (2.36) and then 1 is, in fact, the Legendre transform of
K.

Note that for = () T we have l() = - (). An inference

methodology, parallel to that of likelihood inference for exponential families,


may be developed from the dual likelihood (2.37). The estimates, tests and
confidence regions discussed by Amari and others under the name of = -1 (or
mixture) procedures are, essentially, part of the dual likelihood methodology.
More generally, based on Amari's concepts of -geometry and

divergence, one may for each [-l,l] introduce an "-likelihood" L by


L() = L( t) = exp{-D(,())}
where

Here p(x ) is given by (2.36) and the function f is defined as

(2.38)

Differential and Integral Geometry in Statistical Inference

x log x,
f (x) = - ^ { - x
1-
-log X,

113

= 1
(1+)/2

},

-l<<l .

(2.40)

= -1

Letting 1 = log L we have, in particular,


1
l() = l() = -I(,) = t - () - (t)
and

-1
l() = -I(,) = - f() - ()

(2.41)

(2.42)

where I denotes the discrimination information. Furthermore, for -l<<l,

-*l*) +2^() - ( ^ + ^ ) }
4
K) = -K [e 2
2
2
_>
-

Affine subsets of are simple from the likelihood viewpoint while,


correspondingly, affine subsets of T are simple in dual likelihood theory. Dual
affine foliations, of and T respectively, are therefore of some particular
interest. Such foliations have been studied in Barndorff-Nielsen and Blaesild
(1983a), see also Barndorff-Nielsen and Blaesild (1983b).
Suppose that the auxiliary component a of (,a) is approximately or
exactly distribution constant, i.e. a is ancillary. For instance, a may be the
affine ancillary or the directed log likelihood ratio statistic, as defined in
Barndorff-Nielsen (1980, 1986b). We may think of the partitions generated,
respectively, by a and as foliations of T, to be called the ancillary
foliation and the maximum likelihood foliation.

(Amari's ancillary subspaces

are then, in the present terminology and for = 1, leaves of the maximum likelihood foliation.)
Exponential transformation models. A model M^which is both transformational and exponential is called an exponential transformation model. For
such models we have the following structure theorem (Barndorff-Nielsen,
Blaesild, Jensen and Jorgensen (1982), Eriksen (1984b)).
Theorem 2.1. Let M be an exponential transformation model with

114

0. E. Barndorff-Nielsen

acting group G. Suppose X^ is locally compact and that t is continuous. Furthermore, suppose that G is locally compact and acts continuously on X_.
Then there exists, uniquely, a k-dimensional representation A(g) of
G and k-dimensional vectors B(g) and B(g) such that
t(gx) = t(x)A(g) + B(g)

(2.43)

(g) = ejAg" 1 )* + &(g)

(2.44)

where eG denotes the identity element. Furthermore, the full exponential model
generated by M_ is invariant under G, and &* = {[A(g~ )*,&(g)]: gG} is a group of
affine transformations of R leaving and into invariant in such a way that
(gP) = (P)A(g'V + B(g),

gG, PlP .

Dually, G = [A(g),B(g)]:gG} is a group of affine transformations leaving


C = cl conv t( X_ ) as well as T = (int) invariant. Finally, let 6 be the
function given by
(g) = a((e))a((g))~1exp(-(g).B(g)).

(2.45)

a((gP)) = a((P))(g)"1exp(-(gP).B(g)).

(2.46)

We then have

Exponential transformation models that are full are a rarity.


However, important examples of such models are provided by the family of Wishart
distributions and the transformational submodels of this.
In general, then, an exponential transformation model M is a curved
exponential model. It is seen from the above theorem that the full model M_
generated by f^ is a composite transformation model and that, correspondingly,
P4 (and, hence and T) is a foliated manifold with M as a leaf. It seems of
interest to study how the leaves of this foliation are related geometricstatistically. Exponential transformation models of type (k,d), and in particular those of type (2,1), have been studied in some detail by Eriksen (1984a,c).
In the first of these papers the Jordan normal form of a matrix is an important
tool.

Differential and Integral Geometry in Statistical Inference

115

Many of the classical differentiable manifolds with their associated


acting Lie groups are carriers of interesting exponential transformation models.
Instances of this are compiled in table 2.1.
Analogies between exponential models and transformation models.
There are some intriguing analogies between exponential models and transformation models.
Example 2.3. Under a d-dimensional location parameter model, with
as the location parameter and for a fixed value of the (ancillary) configuration statistic, the possible score functions are horizontal translates of each
other.
On the other hand, under a (k,d) exponential model, with as a
component of the canonical parameter and provided the complementary part of the
canonical statistic is a cut, the possible score functions are vertical translates of each other. (For details, see Barndorff-Nielsen (1982)).
Example 2.4. Suppose is one-dimensional. If is the location
parameter of a location model then the correction term C,, given by (2.10),
takes the simple form
i
(4) j ( 3 ) 2
}
{3
+5
l1 "24
2 V " ^ 3

Exactly the same expression is obtained for a (1,1) exponential


model with as the canonical parameter.
(This was noted in Barndorff-Nielsen and Cox (1984)).
Maximum estimation. Suppose that for a certain class of models we
have an estimation procedure according to which the estimate of is obtained
by maximizing a positive function M = M() = M( x) with respect to . Let
m = log M and suppose that
c = -[3 3 s m]()

(2.47)

is positive definite. We shall then say that we have a maximum estimation procedure. Maximum likelihood estimation and dual maximum likelihood estimation
-1
(where m() = l() = () - l(), cf. (2.37)) are examples of this. More

116

0. E. Barndorff-Nielsen

generally, minimum contrast estimation, as discussed by Eguchi (1983), is of


this type.
Suppose that M depends on x through the minimal sufficient statistic only and let a be an auxiliary statistic such that (,a) is minimal sufficient. In generalization of (2.7) we may consider
p*(2f;|a) = c\H\\/,
as a possible approximation to p(;|a).

(2.48)

Here t = L() and c is a norming

constant, determined so as to make the integral of the right hand side of


(2.48) with respect to equal to 1.
It will be shown in section 5 that (2.48) is exactly equal to
p(;|a) for a considerable range of cases.
Finally, it may be noted that by an argument of analogy it would
seem rather natural to consider the modification of (2.48) in which the function M is substituted for the likelihood function L. While this approach is
not without interest its general asymptotic degree of accuracy is only 0(n )
in comparison with 0(n~-1 ) or 0(n~-3/2
' ) for (2.48). Also, for transformation
models this modification is exact in exceptional cases only.

Differential and Integral Geometry in Statistical Inference

X
4

a i

CD

r-4 ^- O

0)

CO
H

H
O
-t

CQ
CD u
CO

0
>

id

CO
H

U
H

en

n
x

II

P -H

CO

CO

o
0 Q)

O
H

g -P 4J

o
en

O
in

O
CO

11 If
CD

ft

5
o

rH CO

H W H
M V4
+J TJ P

CQ U

<D
H
4J
CO

117

3. TRANSFORMATION MODELS

Transformation models were introduced in section 2. For any xX.


the set Gx = {gx:gG} of points traversed by x under the action of G is termed
the orbit of x. The sample space ) M s thus partitioned into disjoint orbits,
and if on each orbit we select a point u, to be called the orbit representative,
then any point x in X^can be determined by specifying the representative u of
Gx and an element zG such that x = zu. In this way x has, as it were, been
expressed in new coordinates (z,u) and we speak of (z,u) as an orbital decomposition of x.
The orbit representative, or any one-to-one transformation thereof,
is a maximal invariant - and hence ancillary - statistic, and inference under
the model proceeds by first conditioning on that statistic.
The action of G on a space Xjs said to be transitive if ^consists
of a single orbit and free if for any pair g and h of different elements of G
we have gx j hx for every xX^. Note that after conditioning on a maximal
invariant statistic u we have a transitive action of G on the conditional sample
space. For any xX^ the set G = {g:gx = x) is a subgroup, called the isotropy
group of x. The space Xjs said to be of constant orbit type if it is possible
to select the orbit representatives u so that G is the same for all u.
The situation is particularly transparent if the action of G on the
sample space _X is free. Then for given x and u there is only one choice of zG
such that x = zu, and _X is thus representable as a product space of the form
U x G where U is the subset of X, consisting of the orbit representatives u.
Note that u and z as functions of x are, respectively, invariant and equivariant
118

Differential and Integral Geometry in Statistical Inference

119

i.e.
u(gx) = u(x),

z(gx) = gz(x).

It is c "ten feasible to construct an orbital decomposition by first finding an


equivariant mapping z from _X onto G and then defining the orbit representative
u for x bv
u = z~ x.
In particular, the maximum likelihood estimate g of g is equivariant, and may be
used as z provided g(x) exists uniquely for eyery x_X and g(X) = G. In this
case, G's action on P^ must also be free.
However, we shall need to treat more general cases where the actions
of G on X_ and on P_ are not necessarily free.
Let H and K be subsets of G. We say that these constitute a
factorization of G if G is uniquely factorizable as
G = HK
in the sense that to each element gG there exists a unique pair (h,k)HK such
that g = hk. We speak of a left factorization if, in addition, K is a subgroup
of G, and similarly for right factorization. If a factorization is both left
and right then G is said to be the product of the groups H and K. An important
example of such a product is afforded by the well-known unique factorization of
a regular n x n matrix A into a product UT of an orthogonal matrix U and a
lower triangular matrix with positive diagonal elements, i.e., using standard
notations for matrix groups, GL(n) is the product of 0(n) and T + (n).
A relevant left factorization is often generated in the following
way. Let P be a member of the family P^ of probability measures for a transformation model M_, and let K be the isotropy group G p , i.e.
K = {gG gP = P}.
For each PP^ we may select an element h of G such that P = hP, and letting H be
the set consisting of these elements we have a (left) factorization G = HK.
(In a more technical wording, the elements h are representatives of the left
]

cosets of K.) Note that G p = hG p h" , and that the action of G on P is free if

120

0. E. Barndorff-Nielsen

and only if K consists of the identity element alone. The quantity h parametrizes _P.
Suppose G = HK is a factorization of this kind. For most transformation models of interest, if the action of G on X. is not free then there exists
an orbital decomposition (z,u) of x with zH and such that for every u the isotropy group G equals K and, furthermore, if z and z 1 are different elements of
H then zu f z'u.
Example 3.1. Hyperboloid model. This model (Barndorff-Nielsen
(1978b), Jensen (1981)) is analogous to the von Mises-Fisher model but pertains
to observations x on the unit hyperboloid H^"1 of R k , i.e.

k-1 = {x:x*x = 1, x >0}


Q

where x = (XQ,X,,... ,x. ,) and * denotes the non-definite scalar product of


vectors in R which is given by
*y = o y o - 1 y 1 - . . . - k _ 1 y k _

The analogue of the orthogonal group 0(k) is the so called pseudoorthogonal group 0(1,k-1), which is the subgroup of GL(k) with matrix representation
0(1,k-1) = {U:ll* 1 U = 1}
where t denotes the k x k diagonal matrix
1 0

0 -1

. -1

For k = 4 this is the Lorentz group of relativistic physics. Topologically,


the group 0(1,k-1) has four connected components, of which one is a subgroup of
0(1,k-1) and is defined by

Differential

and Integral Geometry in Statistical Inference

121

SO (l,k-l)

(the elements of U are denoted by u.., i and j = 0,1,...,k-l). This subgroup


J
is called the special pseudo-orthogonal group and it acts on Hk 1by (U,x)
k-1

(vector-matrix multiplication).

The points of H

can be expressed in hyper-

bolic-spherical coordinates as
XQ = cosh u
x. = sinh u cos v,
x 2 = sinh u sin v, cos v ?

x. 1 = sinh u sin v 1 ... sin v k _ 2 ,


k-1
and an invariant measure on H

, relative to the action of SO (l,k-l), is

specified by
d = sinh k " 2 u sin k " 3 v 1 ... sin v k _ 3 dudv ] ... dv k _ 2
(3.1)
The hyperboloid model function, relative to the invariant measure
(3.1) on H k ~ \ is
p(x;,) = a k ()e" * x

(3.2)

where the parameters and , called the mean direction and the precision,
satisfy Hk-1 and >0, and where
a k () =

k/2

- /{(2)

k/2

' 2K k / 2 _ 1 ()}

(3.3)

with Kk/2_- a Bessel function.


For any fixed , the hyperboloid distributions (3.2) constitute a
transformation model under the action of SO (l,k-l), and the induced action on
the parameter space is (U,) > U* (vector-matrix multiplication). The isotropy
group K of the element = (1,0,...,0) may be identified with SO(k-l). Furthermore, SO*(l,k-l) can be factored as
SO*(l,k-l) = HK = H SO(k-l)

122

0. E. Barndorff-Nielsen

where the matrix representation of hH is

Vl
X

1 + l+x

1X2

Vk-1
1+X

h=

1+

Vl

k-1 x 1
1+Xn

k-l x 2
1+x 0

for x = (xQ,x-|,...,xk->1) varying over H

2 X k-1

(3.4)

. 1 + k-1

1+Xn

. In relativity theory a Lorentz

transformation of the type (3.4) is termed a "pure Lorentz transformation" or


a "boost."

(It may be noted that SO f (l,k-l) can equally be factored as KH with

the same K and H as above.)


We have already mentioned the concept of equivariance of a mapping
from .X onto G. More generally, if s is a mapping of X_ onto a space S and if
s(x) = s(x') implies s(gx) = s(gx') for x,x'_X and all gG then s is said to be
equivariant.

In this case we may define an action of G on S by gs = s(gx)

for s = s(x) and for any x_X, and we speak of this as the action induced by s.
In the applications to be discussed later S is typically the parameter domain
under some parametrization of the model and s is the maximum likelihood estimator, which is automatically equivariant.
We are now ready to state the results which constitute the main
tools of the theory of transformation models.
Subject to mild topological regularity conditions (for details, see
Barndorff-Nielsen, Blaesild, Jensen and Jorgensen (1982)) we have
Lemma 3.1. Let u be an invariant statistic with range space U =
u U ) , let s be an equivariant statistic with range space S = s O O , and assume
that the induced action of G on S is transitive. Furthermore, let be

Differential and Integral Geometry in Statistical Inference

123

invariant measure on X_. Then, we have (s,u)QQ = S x U and


(s,u)y = v >< p

where v is an invariant measure on S and p is some measure on U.


Suppose r, s and t are statistics on X_ (in general vector-valued).
The symbol r i s|t is used to indicate that r and s are conditionally independent given t.
Theorem 3.1. Let the notations and assumptions be as in lemma 3.1,
and suppose that the transformation model has a model function p(x g) relative
to an invariant measure on X such that p(x) = p(x e) is of the form
p(x) = q(u)r(s,w)

(3.5)

for some functions q and r and some invariant statistic w which is a function
of u.
Then the following conclusions are valid.
(i) The model function p(x g) is of the form
p(x g) = q(u)r(g" s,w),

(3.6)

and hence the statistic (s,w) is sufficient.


(ii) We have
s i u|w.
(iii) The invariant statistic u has probability function
p(u) = q(u)/r(s,w)dv(s)

<p>

(3.7)

(where v is invariant measure on S).


(iv) The conditional probability function of s given w is
p(s;g|w) = c(w)r(g" s,w) <v>

(3.8)

where c(w) is a norming constant.


It should be noted that the theorem covers the case where no sufficient reduction is available (take q constant and w = u) as well as the case
where s - typically the maximum likelihood estimator - is sufficient (take w
degenerate). Note also that theorem 3.1 does not assume that the action of G
is free. If, however, the action is free and if (z,u) is an orbital decomposition of x then the theorem applies with s = z.

1 2 4

0. E. Barndorff-Nielsen

Example 3.2. Hyperboloid model (continued). Let x,,...,x be a


sample from the hyperboloid distribution (3.2) and let x = (x,,...,x ) and
x + = X-.+ ... +x . Considering as fixed, theorem 3.1 applies with u as the
maximal invariant statistic, s = x + // x + *x + and w = / x + *x + . In particular,
it turns out that the conditional distribution of s given w (or, equivalently,
given u) is again a hyperboloid distribution, with mean direction and precision w. This is in complete analogy with the von Mises-Fisher situation,
and accordingly s and w are termed the mean direction and the resultant length
of the sample. For details and further results see Jensen (1981) and BarndorffNielsen, Blaesild, Jensen and Jorgensen (1982).
Lemma 3.1 and theorem 3.1 are formulated in terms of invariant
dominating measures on X^ and S. In applications, however, the probability functions are ordinarily expressed relative to Lebesgue measure - or, more generally, relative to geometric measure when the underlying space is a differentiate
manifold.

It is therefore important to have a formula which gives the relation

between the two types of dominating measure.


Let be an action of G on a space Y_ and suppose _Y has constant
orbit type under this action. Then there exists a subgroup K of G, a subset H
of G and an orbital decomposition (z,u) of y ^ such that G u = K and zH for
every y. We assume that H can be chosen so that HK constitutes a (left)
factorization of G. If X is a differentiate manifold and if acts differentiably on X then an invariant measure o n Y can typically be constructed from
geometric measure on _Y, by means of Jacobians. In particular, if X is an
r

open subset of some Euclidean space R , so that is Lebesgue measure, then


defined by
O(z)(ud(y)

(3.9)

will be invariant; here J / \ denotes the Jacobian determinant of the mapping


(g) of X onto itself. A proof of this is sketched in appendix 1.
Example 3.3. Hyperboloid model (continued). We show here how the
invariant measure (3.1) on the unit hyperboloid H k " 1 may be derived from

Differential and Integral Geometry in Statistical Inference

125

For simplicity, suppose k = 3. The manifold H 2 is in one2


to-one smooth correspondence with R through the mapping
Lebesgue measure.

and we start by finding an invariant measure on R . The action of S0 f (l,2) on


2
?
H is given by (U,x) -> xU* and the induced action on R is therefore of the
form (U,y) -> (" (y)U*). These actions are transitive, and if we take
2
u = (0,0) as the orbit representative of R and let z be the boost
y0

yR

++

y2

1 +

z=

y1

y,

(3.10)

2
2
y-i + Yo then (u,z) constitutes an orbital decomposition of

of the type required for the use of formula (3.9).

Letting denote the


2
2
action of S0 (l,2) on R one finds that J'( z )(u) = l/l + y 2 + y 2 and hence the

measure
dy(y) =
p

is an invariant measure on R . Shifting to hyperbolic-spherical coordinates


(u,v) for (y , 5 y ? ) this measure is transformed to (3.1) with k = 3.
Below and in sections 4 and 5 we shall draw several important conclusions from lemma 3.1 and theorem 3.1. Various other applications may be
found in Barndorff-Nielsen, Blaesild, Jensen and Jorgensen (1982).
Corollary 3.1.

Let G = HK be a left factorization of G such that

K is the isotropy group of p. Thus the likelihood function depends on g through


h only.

Suppose theorem 3.1 applies with S = H and let L(h) = L(h x) be any

version of the likelihood function. Then, the conditional probability


function of s given w may be expressed in terms of the likelihood function as

12 6

0. E.

Barndorff-Nielsen

p(s;h|w) = c(w) {[]

<v> .

(3.11)

In formula (3.11) the likelihood function changes with the value of


s. However, an alternative expression for the conditional probability function
is available which employs only the single observed likelihood function. Suppose for simplicity that K consists of the identity element alone, so that
S = G. Further, let XQ denote the observed point in X^ and write Ln(g) for
L(g;x 0 ). Also, for specificity, let the action of G on S = G be the so called
left action of G on itself, i.e. a gG acts on a point sS simply by multiplying s on the left by g, in the group theoretic sense. (Thus, the two possible
interpretations of the symbol gs coincide). The situation here specified
occurs, in particular, if the action of G on X is free and if s is the group
component of an orbital decomposition of x. Setting s Q = s(x Q ) and w~ = w(x Q ),
we are interested in the conditional distribution of s given w = w Q and by
(3.6) and (3.11) this may be written as
P(s;g|w o ) = c(w 0 )

L Q (s s~ g)
L Q ( S Q )

<> ,

the invariant measure being denoted here by , as a standard notation for left
invariant measure on G. This formula, which generalizes a similar
expression for the location-scale model due to Fisher (1934), shows how the
"shape and position" of the conditional distribution of s is simply determined
by the observed likelihood function and the observed s Q , respectively.
Formula (3.11), however, besides being slightly more general, seems
more directly applicable in practice.

4.

TRANSFORMATIONAL SUBMODELS

Let M be a transformation model with acting group G.

If P~ is any

of the probability measures in M^ and if G Q is a subgroup of G then P~ =


{9Pg : 9G 0 } defines a transformation submodel M~ of M_. For a given G n the collection of such submodels typically constitutes a foliation of M^.
Suppose G is a Lie group, as is usually the case. The one-parameter
subgroups of G are then in one-to-one correspondence with TG , the tangent
space of G at the identity element e, and this in turn is in one-to-one correspondence with the Lie algebra of left invariant vector fields on G. More
generally, each subalgebra h^ of the Lie algebra of G determines a connected
subgroup H of G whose Lie algebra is ] (cf., for instance, Boothby (1975) chapter 4, theorem 8.7). If ATG , the one-parameter subgroup of G determined by
A is of the form {exp(tA):tR}. In general, the subgroup of G determined
by r linearly independent elements A,,...,A

of TG may be represented as

Example 4.1. Let M be a location-scale model,


p(x 1 5 ...,X n ;,) = "
1

f(" (x.-)).
]
i=l

(4.1)

Here G is the affine group with elements [,] which may be represented by
2 2 matrices
1

JJ

the group operation being then ordinary matrix multiplication. The Lie algebra
of G, or equivalently TG , is represented as the set of 2 x 2 matrices of the
127

128

0. E. Barndorff-Nielsen

form
A=

We have
e t A = I + tA + 2T t 2 A 2 +.

b/a(e t a -l) e t a
where the last expression is to be interpreted in the limiting sense if a = 0.
There are therefore four different types of submodels. Specifically, letting U Q ^ Q )

denote an

arbitrary value of (,) and taking P Q as the

corresponding measure (4.1) we have


(i) If a = 0 then P~ is a pure location model.
(ii) If a f 0, b = 0 and Q = 0 then P is a pure scale model.
(iii) If a f= 0, b = 0 and Q f 0 then M~ may be characterized as
the submodel of M_ for which the coefficient of variation / is constant and
equal to VQ/OQ
(iv) If both a and b are different from 0 then F> may be characterized as the submodel PL of M for which " (+b/a) is constant and equal to
C

' e

if we let c = b/a then NL is determined by


" (+c) = c 0 .

(4.2)

Letting F denote the distribution function of f we can express (4.2) as the


condition that (,) is such that -c is the F(-co)-quantile of the distribution
The above example is prototypical in the sense that G is generally
a subgroup of the general linear group GL(m) for some m and TG may be represented as a linear subset of the set M(m) of all m x m matrices.
Example 4.2. Hyperboioid model. The model function of the hyperboloid model with k = 3 and a known precision parameter may be written as

Differential and Integral Geometry in Statistical Inference

p(u,v; x ,) = (2)-1esinh u e " { c o s h *

c o s h

"sinh*

s i n h

129

The generating group G = S0 f (l;2)

where u > 0, v[0,2) and x > 0, [0,2).

may be represented as the subgroup of GL(3) whose elements are of the form

0
cos
-sin

where -><<->.

cosh

sinh

sin j I sinhx

cosh x

COS

, 2

(4.4)

This determines the so called Iwasa decomposition (cf., for

instance, Barut and Raczka (1980) chapter 3) of S0*(l;2) into the product of
three subgroups, the three factors in (4.4) being the generic elements of the
respective subgroups.

It follows that TG

is the linear subspace of M(3) gen-

erated by the linearly independent elements

I 0

' 0

1 , E2 = j 1
!

0 - 1 0

1
1 I

-1

Each of the three subgroups of the Iwasawa decomposition generates


a transformational foliation of the hyperboloid model given by (4.3), as discussed in general terms above.

In particular, the group determined by the

third factor in (4.4) yields, when applied to the distribution (4.3) with
X = = 0, the following one-parameter submodel of the hyperbolic model:
p(u,v;)
(2

~(

c o s h

^sinh u

"Snh

c o s

) ~ 2 ^s i n h

The general form o f t h e one-parameter subgroups o f SO ( 1 ; 2 ) i s


0
exp{t

aa

b ~i

I
where a, b, c are fixed real numbers.

-c

s i n

5. MAXIMUM ESTIMATION AND TRANSFORMATION MODELS

We shall be concerned with those situations in which there exists an


invariant measure y on X that dominates P_9 where P^ = {gP:gG} is transformational. Letting
^~M

= P( g)

and writing p(x) for p(x e) we have


p( g) = p(g" x) <u>.
In most cases of interest the model has the following additional structure (possibly after deletion of a null set from _X , cf. also section 3). There exists
a left factorization G = HK of G, a K-invariant function f on X_, and an orbital decomposition (fr,u) of x such that:
(i) G = K for all u and, furthermore, G p = K. Hence, in particular, H may be viewed as the parameter space of the model.
(ii) For eyery x_X the function m(h) = f(h" x) has a unique maximum
on H and the maximum point is fr.
(iii) H may be viewed as an open subset of some Euclidean space R
and for each fixed xX_ the function m is twice continuously differentiate on H
and the matrix * = ^(h) given by

is positive definite.
In these circumstances we have:
Proposition 5.1. The maximum estimator ft is an equivariant mapping
130

Differential and Integral Geometry in Statistical Inference

131

of _X onto H and the action of G on H induced by i coincides with the natural


action of G on H. Furthermore, if the mapping x > (F,u) is proper then there
exists an invariant measure v on H, and for any fixed u such a measure is given
by
dv(h) = j-fe I ^dh

(5.1)

where dh indicates the differential of Lebesgue measure on H.


Here H is considered as an open subset of R , in accordance with
(111).
Proof. The equivariance of h follows immediately from (ii). Obviously, there is a one-to-one correspondence between the family of left cosets
G/K = {gK:gG} and H. Let p be the mapping from G/K to H which establishes this
correspondence. The natural action of G on G/K is given by
G x G/K ^ G/K
:
(g.gK) -> ggK
and we have to show that when this action is transferred to H by p it coincides
with the action of G on H induced by ft. In other words, we must verify that
for any gG the diagram
G/K

(g)

I (g)

|
G/K

(5.2)

H
P

commutes. Let be the mapping from G to H that sends a gG into the uniquely
determined hH such that g = hk for some kK. For any fr = (x) in H we have
that (g)fr = (gx) is determined by
f({V(g)}" gx) > f(h - 1 gx),

hH.

(5.3)

Now, by the K-invariance of f,


f(h

-1

gx) = f U g ^ h ^ x ) = f((g" h)" x)

and here (g" h) ranges over all of H when h ranges over H. Hence (5.3) may be
rewritten as
f(h" ),

hH,

132

0. E. Barndorff-Nielsen

i . e . , by ( i i ) ,

= n(rt(gx))
or,

equivalently,

() =
and this, precisely, expresses the commutativity of (5.2), since p~ (h) = hK.
When the mapping x -> (f,u) is proper the subgroup K is compact
because K = G . Hence there exists an invariant measure on H, cf. appendix 1.
That |tfpdh is such a measure follows from (3.9) and formula (5.10) below.
In particular, then, there is only one action of G on H at play,
namely , and
(g)h = (gh).

(5.4)

Now, let h > be an arbitrary reparametrization of the model and


let m() = m(h()) and

,2
*() =*(;tl) = - J L j L ( ; h u ) .

(5.5)

This matrix is a (0,2) tensor on .


We shall now show that
-k(h) =*(h;u) = J.
(e)~**(e;u)J
(e)" 1 .
(h)
(h)

(5.6)

Here the unit element e is to be thought of as a point in H.


We have
m(h) = f(h" ] x) = f(h"u) = f ({ ( h) u)
where, again, we have used the K-invariance of f. Thus, with as the projection mapping defined above we obtain

Mjp(h) _ M i

((rh))

^ h ) * (h)

(5.7)

and
2

3 m(h;x) (h) . 3( h) (h) ( h u) /./ft-lh^ 3n( h)* fh .


(h) .

(5.8)

Differential and Integral Geometry in Statistical Inference

133

In these expressions we have, since ( h) = (i~)h, that


(5.9)
On inserting ft for h in (5.7), (5.8) and (5.9) (whereby (5.7) becomes 0) and
combining with (2.1) we obtain (5.6).
From (5.6) we may draw two important conclusions.
First, taking determinants we have
5

ne',u)\h

(5.10)

and this, by (3.9) and the tensorial nature of *, implies that j-RfJl^d is an
invariant measure on . In connection with formula (5.10) it may be noted that
J

'(h) ( e ) = J ( h ) ( e )

where 6 denotes left action of the group G on itself. A proof of this latter
formula is given in appendix 2.
Secondly, the tensor -K() is found to be G-invariant, whatever the
value of the ancillary.

In fact, by (5.4) we have, for any h Q H and gG,


((g)h)h0 = (g) o (h)h Q .

Consequently
- (g)
and this together with (5.6) and (2.26) establishes the invariance.
In particular, observed information ^determines a G-invariant
Riemannian metric on the parameter space. The expected information metric i
can also be shown to be G-invariant.
From proposition 5.1 and corollary 3.1 we find
Corollary 5.1. The model function p*(*,|u) = c|<|\/t' is exactly
equal to p(;|u).
By taking m of (ii) equal to the log likelihood function 1 this
corollary specializes to theorem 4.1 of Barndorff-Nielsen (1983).
Suppose, in particular, that the model is an exponential transform-

134

0. E. Barndorff-Nielsen

ation model. Then the above theory applies with m() = l(). The essential

-j

property to check is that l(;t(x)) is of the form f(h x). This follows simply

from the definition of 1 and theorem 2.1.

6.

OBSERVED GEOMETRIES

In section 2 we briefly reviewed how the parameter space of the


model f^ may be set up as a manifold with expected information i as Riemannian
metric tensor and with an associated family of affine connections, the -connections (2.30). We shall now discuss a similar type of geometries on the
parameter space, related to observed information and depending on the choice of
the auxiliary statistic a which together with the maximum likelihood estimator
constitutes a minimal sufficient statistic for fi. These latter geometries
are termed observed geometries (Barndorff-Neilsen, 1986a).

In applications to

statistical inference questions it will usually be appropriate to take a to


be ancillary but a great part of what we shall discuss does not require distribution constancy of a and, unless explicitly stated otherwise, the auxiliary a is considered arbitrary (except for the implicit smoothness properties).
Let an auxiliary a be chosen. We may now take partial derivatives
of 1 = l(;,a) with respect to the coordinates of as well as with respect
to . Letting a = a/3 we introduce the notation

r s

= 9

rr..rp,s..sq

\ h

(6J)

r r . . rp s r . . sq

and refer to these quantities as mixed derivatives of the log model function.
The function of and a obtained from (6.1) by substituting for will be
. Thus, for instance,
denoted by \
r..rp;s..s
=

*rs;t

(;a)

More generally, for any combinant g of the form g(;,a) we write


135

136

0. E. Barndorff-Nielsen

-? = -f(;a) = g(;,a).

This is in consistency with the notation # introduced by (2.6). The observed


geometries, to be discussed, are expressed in terms of the mixed derivatives
(6 2 )

\
r s
s
r..rpSs..sq
So are the terms of an asymptotic expansion of (2.7), cf. section 7.

Given the observed value of a the observed information tensor ^, of


(2.6), defines the parameter space of M^ as a Riemannian manifold. The Rieman-

nian connection determined by a- has connection symbols -t


F given byt
& =

rst * ''"At " Vrs * Vrt'

Employing the notation established above we have 9.6- = -*c + -Jc. +9 etc.
u rs
rsu rs,u
so that

As we shall now show, the quantity


p

= _(}

+ > . t [3])

(6.4)

is a covariant tensor of rank 3, i.e.


*p "T rst /p / /

(6

' 5)

First, from (2.14) we have

/p / /

rS /P /

[3]

'

Further, from (2.13) we obtain, on differentiating with respect to and then


substituting parameter for estimate,
V,

= +

rs;t /p / /

Finally, differentiating the likelihood equation

we find

V,t /P /'

(6

'

Differential and Integral Geometry in Statistical Inference

137

*rs
or

Combination of (6.4), (6.6), (6.7) and (6.9) yields (6.5).


It follows from the tensorial nature of ? and from (6.3) and (6.9)

that for any real an affine connection ? on M^may be defined by


f

rs ~ * *Vsu

with

In particular, we have
1

-1

V,rs

(6J1)

where to obtain the latter expression we have used

which follows on differentiation of (6.8).


1
t rs

It may also be noted that

- 1 1
rts str

- 1
str rts

and
a

l.l

"1

The connections -f, which we shall refer to as the observed -con

nections, are analogues of the expected -connections r given by (2.30). The

analogy between r and -F becomes more apparent by rewriting the skewness tensor
(2.29) as
T

rst = "

E{1

rst

the validity of which follows on differentiation of the formula


E{1

rs + V s

}=

'

(6J2)

which, in turn, may be compared to (6.8).


Under the specifications of a of primary statistical interest one

138

0. E. Barndorff-Nielsen

has that, in broad generality, the observed geometries converge to the corresponding expected geometries as the sample size tends to infinity.
For (k,k) exponential models
p(x ) = a()b(x)e # t ( x )

(6.13)

no auxiliary statistic is involved since is minimal sufficient, and we find

j- = i and = r, R.
Let i,j,k,... be indices for the coordinates of , t and , using
upper indices for and lower indices for t and .
In the case of a curved exponential model (2.35), we have
\ = (t-).}

(6.14)

and, letting denote the maximum likelihood estimator of under the full model
generated by (2 35), the relation +
= ? takes the form
r, s rs

V , s ( ) = ij ( ) jr*/s
" 1J ( > /r /s - (*-^i
Furthermore,
-<1jk()/r/s/t
rst

i j

;rs^t=irst

(6.17)

and
( 61 8 )

^ rs-j^/t^/rs-'rsf

I t is also to be noted that, under mild regularity conditions, the quantities


& and ^possess asymptotic expansions the f i r s t terms of which are given by

and

rst " { jk ;rs /t /\ [ 3 ]

Wx

} a

'

(6 20)

Differential and Integral Geometry in Statistical Inference

139

where a , = l,...,k-d, are the coordinates of the auxiliary statistic a. For


instance, in the repeated sampling situation and letting a denote the affine
ancillary, as defined in Barndorff-Nielsen (1980), we may take a = n a and
the expansions (6.19) and (6.20) are asymptotic in powers of n"*5. (For further
comparison with Amari (1982a) it may be noted that the coefficient in the first
e
e
order correction term of (6.19) may be written as /ir s / i- j - ; = n^ s where H
is Amari's notation for the exponential curvature, or -curvature with = 1, of
the curved exponential model viewed as a manifold imbedded in the full (k,k)
model.)
For a transformation model we find
l (h;x) = T r
(cf. the more general formula (5.7)) and hence

(6

21)

(6.22)
where, for a r = a/ah and a r = a/ahr,
A^ = a s n r (h" 1 l
so that
S

" ~(h)

while
B

st = 3 s V

s;t

st

140

0. E. Barndorff-Nielsen

Furthermore, to write the coefficients of 1 , c ,.,(e;u) in (6.21) and (6.22) as


r s K*

indicated we have used the relation


vh"h)L
= -3 (h" h)| .
s
h=h s
h=h

(6.24)

Formula (6.24) is proved in appendix 3.


We now briefly consider four examples. In the first three the
model is transformational and the auxiliary statistic a is taken to be the maximal invariant statistic, and thus a is exactly ancillary. In the fourth example a is only approximately ancillary. Examples 6.1, 6.3 and 6.4 concern
curved exponential models whereas the model in example 6.2 - the location-scale
model - is exponential only if the error distribution is normal.
Example 6.1. Constant normal fractile. For known (0,l) and
c(-oo5oo)5 let N

denote the class of normal distributions having the real

"~Dt , C

number c as -fractile, i.e.


N . = {N(,2):(c-)/ = u },
where u denotes the -fractile of the standard normal distribution, and let
x,...,x be a sample from a distribution in N . The model for x = (x,,... ,x,J
1
n
,c
I
n
thus defined is a (2,1) exponential model, except for u = 0 when it is a (1,1)
model. Henceforth we suppose that u =)= 0, i.e. f h. The model is also a
transformation model relative to the subgroup G of the group of one-dimensional
affine transformations given by
G = [c(l - ),]:>0},
the group operation being
[c(l - ),][c(l - '),'] = [c( - '),']
and the action of G on the sample space being
[c( - ),](x r ...,x n ) = (c( - ) + x r ...,c(l - ) + x n ).
(Note that G is isomorphic to the multiplicative group.)
Letting
a = (x - c)/s',

Differential and Integral Geometry in Statistical Inference

141

where x = (x, +...+ x n )/n and


s

^^x.

x; ,

we have that a is maximal invariant and, parametrizing the model by = log ,


that the maximum likelihood estimate is
= log(bs')
where
b = b(a) = (u /2)a + / l + {(u / 2 ) 2 + l}a 2 .
Furthermore, (,a) is a one-to-one transformation of the minimal sufficient
statistic (x,s') and a is exactly ancillary.
The log likelihood function may be written as
l() = l ( ; , a ) = n[ - - ^ { b - 2 e 2 ( ^ ) + (u + a t f V " 5 ) 2 } ]
from which it is evident that the model for given

a is a location model.

Indicating differentiation with respect to and by subscripts


and , respectively, we find
l = n{-l + b - 2 e 2 ( ^ ) + ab" 1 (u + a t f V ^ e ^ }
and hence
= n{2b" 2 + ab" ] (u + 2ab" )}
^

n { 4 b

"2

a b

"1(

4 a b

"])

r = -n{4b"2 + ab" (u + 4ab" ] )} = *

.- = n{4b"^ + ab"'(u
j

+ 4ab"')} = -p= -^
ot

and the observed skewness tensor is


Jc = n{8b" 2 + 2ab" 1 (u + 43b" 1 )}.
Note also that

We mention in passing that another normal submodel, that specified

142

0. E. Barndorff-Nielsen

by a known coefficient of variation /, has properties similar to those exhibited by example 6.1.
Example 6.2. Location-scale model. Let data x consist of a sample
x,,...,x from a location-scale model, i.e. the model function is
n
p(x;,) = "

x.-

for some known probability density function f. We assume that {x:f(x)>0} is an


open interval and that g = -log f has a positive and continuous second order
derivative on that interval. This ensures that the maximum likelihood estimate
(,) exists uniquely with probability 1 (cf., for instance, Burridge (1981)).
Taking as the auxiliary a Fisher's configuration statistic
a = (a r ...,a n ) =

X -

X -

which is an exact ancillary, we find


3-(,) =

-2

V(a

a g"(a

a g"(a ) n+a2g"(a
and, in an obvious notation,

f '(a,)

= -"3{2n + 4zal2 g"(a.)


+ za?g"'(a.)}
i
l
i

-3
yy

Differential and Integral Geometry in Statistical Inference

143

3{4n

Kao "
Furthermore,

*Wo

Example 6.3. Hyperbooid model. Let (u-. ,v,),... , (u ,v ) be a


sample from the hyperboioid distribution (4.3) and suppose the precision is
known. The resultant length is
a = { ( cosh u ) - ( sinh u. cos v..) - ( sinh u^ sin v^) 2 }^
and a is maximal invariant after minimal sufficient reduction.

Furthermore,

the maximum likelihood estimate (,) of (,) exists uniquely, with probability 1, (a,,) is minimal sufficient and the conditional distribution of (,)
given the ancillary a is again hyperboloidic, as in (4.3) but with u, v and
replaced by , and a. It follows that the log likelihood function is
l(x) = K 5 ;x, 9 a) = -a{cosh cosh - sinh x sinh cos(-)}
and hence

= -F

XXX

XX

=f

= -F

XX

=0

A = a cosh x sinh
x

-F A = -a cosh x sinh ,
x
whatever the value of . Thus, in this case, the -geometries are identical.
We note again that whereas the auxiliary statistic a is taken so
as to be ancillary in the various examples discussed here - exactly distribu-

144

0. E. Barndorff-Nielsen

tion constant in the three examples above and asymptotically distribution constant in the one to follow - ancillarity is no prerequisite for the general
theory of observed geometries.
Furthermore, let a be any statistic which depends on the minimal
sufficient statistic t, say, only and suppose that the mapping from t to (,a)
is defined and one-to-one on some subset T~ of the full range X of values of t
though not, perhaps, on all of ]_. We can then endow the model M^ with observed
geometries, in the manner described above, for values of t in T~. The
next example illustrates this point.
The above considerations allow us to deal with questions of nonuniqueness and nonexistence of maximum likelihood estimates and nonexistence of
exact ancillaries, especially in asymptotic considerations.
Example 6.4. Inverse Gaussian - Gaussian model. Let x( ) and y( )
2
be independent Brownian motions with a common diffusion coefficient = 1 and
drift coefficients >0 and , respectively. We observe the process x( ) till it
first hits a level x>0 and at the time u when this happens we record the value
v = y(u) of the second process. The joint distribution of u and v is then
given by
p(u,v;,)

Suppose that (u, s v^),... ,(u ,v ) is a sample from the distribution


(6.25) and let t = (,v) where and v are the arithmetic means of the observations. Then t is minimal sufficient and follows a distribution similar to
(6.25), specifically
p(,v;y,)
= (2)" x o ne G" 2 e

(6.26)

Now, assume equal to . The model (6.26) is then a (2,1) exponential model,
still with t as minimal sufficient statistic. The maximum likelihood estimate
of is undefined if t^T^ where

Differential and Integral Geometry in Statistical Inference

145

IQ = it = (,v):x0 + v > 0}

fhereas for tT^, exists uniquely and is given by


-1
y = ^(x 0 + v) .

(6.27)

he event t^T^ happens with a probability that decreases exponentially fast with
he sample size n and may therefore be ignored for most statistical purposes.
Defining, formally, to be given by (6.27) even for t^T^ and leting
a = "(;2nxQ,2 n 2 ) ,
here ( ;x) denotes the distribution function of the inverse Gaussian disribution with density function
-(x .) = ( 2 ) " ^ e ^ x " 3 / 2 e - ^ x " 1 + * x >

(6.28)

e have that the mapping t -> (,a) is one-to-one from X = {t = (,v):>0} onto
-oo,+) x (0,oo) and that a is asymptotically ancillary and has the property
hat p*(;|a) =c|j p L approximates the actual conditional density of given
to order 0 ( n ~ 3 / 2 ) , cf. Barndorff-Nielsen (1984).
Letting ( ;x) denote the inverse function of "( ;,) we may
rite the log likelihood function for as

{ ( X Q+

-2
V) - U }

= n_(a;2nx 2 ,2n 2 ) {2-2}


rom this we find
= -2n (a;2nx 2 2n 2 )
o that
Xg 92n )

+
yy

nd

= 0

(6.29)

14 6

0. E.

^ =

8n

Barndorff-Nielsen

U ;2nx2,2n2)

^(" W O
1
=s

-1
= -h $

where ~ denotes the derivative of "(x;,) with respect to . By the wellknown result (Shuster (1968))
"(x;,) = ( V - hx'h) +
where is the distribution function of the standard normal distribution, "

could be expressed in terms of and = 1 .

7.

EXPANSION OF c l j l ^ L

We shall derive an asymptotic expansion of (2.7), by Taylor expansion of cIjI L in around , for fixed value of the auxiliary a. The various
terms of this expansion are given by mixed derivatives (cf. (6.2)) of the log
model function. It should be noted that for arbitrary choice of the auxiliary
statistic a the quantity c|j|E constitutes a probability (density) function on
the domain of variation of and the expansions below are valid. However,
c|j|[ furnishes an approximation to the actual conditional distribution of
given a, as discussed in section 2, only for suitable ancillary specification
of a.
To expand c|j| L in around we first write E as exp{l-} and
expand 1 in around . By Taylor's formula,
1-1=

VX
(-) ...(-) v (8 ...3
V>>
v=2
ll vv

l)()

whence, expanding each of the terms (d ...a l)() around ,


r

1-1
oo

v=2
.

X
O

vv

izJJ_ (u- ) ...(-)

(-) Sl ...(-) Sp 3. . . . 3 . \
O

..._ .

Consequently, writing for - and 6 "' for (-) (-) ..., we have

147

(7.1)

148

0. E. Barndorff-Nielsen

Next, we wish to expand log{|j|/|j| in around . To do this we observe


that if A is a d x d matrix whose elements a

depend on then

atlog|A| = |A ] 3 t |A|

where a

denotes the (r,s)-element of the inverse of A. Furthermore, using

which is obtained by differentiating a a u s = S with respect to and solving


for a S , we find
Vu

log|A| = -aVraSV\avvars

asl\Vrs.

It follows that

-" t U { * S + r s t u + + r s t ; u + * r s u ; t + + r s ; t u )
(7.3)
By means of (7.2) and (7.3) we therefore find
/ 2
) dd/2
= (2)
c d (-;al + A ] + A 2 + ...}

(7.4)

where .( a-) denotes the density function of the d-dimensional normal distribution with mean 0 and precision (i.e. inverse variance-covariance matrix) a- and
where
A

l - " V ^ W

\^

^St(+rs;t+ I *rst)

and
A2 = [- 3
1

rs t

r s t M vw u

(7

"5)

Differential and Integral Geometry in Statistical Inference

s;t

149

*rst;u

|^st)(+uv;w+|+uvw)],

(7.6)

A^ and A 2 being of order On""15) and 0(n ), respectively, under ordinary repeated sampling.
By integration of (7.4) with respect to we obtain
(2) d / 2 c = 1 + C 1 + ... ,

(7.7)

where C-. is obtained from A by changing the sign of A and making the substitutions

rstu

the 3 and 15 terms in the two latter expressions being obtained by appropriate
permutations of the indices (thus, for example, <s r s t u -> j - r s ^ t u + > r t ^ s u +

Combination o f ( 7 . 4 ) and ( 7 . 7 ) f i n a l l y

yields

c | j | ^ L = ( - ; ) { l + A1 + ( A g + C ^ + . . . }

(7.8)

with an error term which in wide generality is of order 0(n-3/2 ) under repeated
sampling.

In comparison with an Edgeworth expansion it may be noted that the

expansion (7.8) is in terms of mixed derivatives of the log model function,


rather than in terms of cumulants, and that the error of (7.8) is relative,
rather than absolute.
In particular, under repeated sampling and if the auxiliary statistic is (approximately or exactly) ancillary such that
3/2

p(;|a) = p*(;|a){l + 0(n" )}


(cf. section 2) we generally have

150

0. E. Barndorff-Nielsen

p(;|a) = d(-;*){! + A1 + (A2 + C,) + 0(n" 3 / 2 )}.

(7.9)

For one-parameter models, i.e. for d = 1, the expansion (7.8) with


A-., A 2 and C-, as given above reduces to the expansion (2.9). In BarndorffNielsen and Cox (1984) a relation valid to order 0(n-3/2 ) was established, for
general d, between the norming constant c of (2.7) and the Bartlett adjustment
factors for likelihood ratio tests of hypotheses about . By means of this relation such adjustment factors may be simply calculated from the above expression
for C-j.
Example 7.1. Suppose M_ is a (k,k) exponential model with model
function (6.13). Then the expression for C . takes the form
_ 1 r0
rs tu
/o ru sv tw ,o rs tu vw 1
l " 24{ 3 rstu " rst uvw ( 2 + 3 ) }

Cr

where, for a r = a/8r and () = -log a(),

rs... = V s ( )

and where S is the inverse matrix of K .


From (7.8) we find the following expansion for the mean value of :
t

+ . + y^ +

where y? is of order 0(n" ), y is of order 0(n ), and

. .r.St,

,.r.St"

l " '** * + r;st = -1* *

/-,1 N

(7J0)

^str

Hence, from (7.8) and writing 1 for -,,


d( - -

= d( - - ;j) + ^ S t ( ' ;) (^$.t + | + r s t ) + ..->.


-1
where the error term is of order 0(n" ) and where h
tensorial Hermite polynomial
to the tensor

(7.)

T"rn
( ;3") denotes the

(as defined by Amari and Kumon ( 1 9 8 3 ) ) , r e l a t i v e

Using (6.10) we may rewrite


-1/3 the l a s t quantity in (7.11) as

(7.12)

Differential and Integral Geometry in Statistical Inference

151

where

Since
hSt(S';j) = ' W * - / V ^ ]

(7.14)

we find
h r S t ( ' ; ^ r $ t =0
and hence (7.11) reduces to
-1/3
,

c|j|
L = .(
- - y i;a ){l - hht (' j ) - rsL
P . + ...},
u

(7.15)

the error term being 0(n" ).


Suppose, in particular, that the model is an exponential (k,d)
model. We may then compare (7.15) with the Edgeworth expansion for an efficient, bias adjusted estimate of given an ancillary statistic, provided by
formulas (3.33) and (3.25) in Amari and Kumon (1983). It appears that h
"1/3
" 1 / 3 abc
(' j-) z t of (7.15) is the counterpart of Amari and Kumon's r . h u.

P
a

^ab ^ h + H xa^a'iK ' ^^us ^-^^ offers some simplification over the corresponding expression provided by the Amari and Kumon paper.
Note that, again by the symmetry of (7.14), if
-1/3
*rst[3] = 0

(7.16)

for all r,s,t then the first order correction term in (7.15) is 0. Furtherex

more, for any one-parameter model M^ the quantity % with = -1/3, can be made
to vanish by choosing that parametrization for which is the geodesic coordinate for the -1/3 observed conditional connection.

(Note that generally this

parametrization will depend on the value of the ancillary a.) An analogous


result holds for the Edgeworth expansion derived by Amari and Kumon (1983),
referred to above. The parametrization making the = -1/3 expected connection

r vanish has the interpretation of a skewness reducing parametrization, cf.


Kass (1984).

8.

EXPONENTIAL TRANSFORMATION MODELS

Suppose M^ is an exponential transformation model and that the full


exponential model M generated by M is regular. By theorem 2.1 the group G acts
affinely on T = (), and Lebesgue measure on T is quasi-invariant (in fact,
relatively invariant) with multiplier |A(g)|. Assuming, furthermore, that N[
and G have the structure discussed in section 3 with g:|A(g)| = 1} c K we find,
since the mapping g > A(g) is a representation of G, that
|A(h(gx))| = |A(g)||A(h(x))|.
Thus m(x) = |A(fi)| is a modulator and
dv(h) = |A(h)|"dh

(8.1)

is an invariant measure on H (cf. appendix 1).


Again by theorem 2.1 the log likelihood function is of the form
l(h) = {(e)A(h" ] h)* + Bf(h"h)>-w - ((e)A(h~ h)* + (h"h))

(8.2)

where w = t(u) = h" t.


Some interesting special cases are
(i) B( ) or &() or both are 0. Then ( ) of (2.45) is a multiplier (i.e. a homomorphism of G into (R + , ))

Furthermore, if &() = 0 and if

(2.35) is an exponential representation of M_ relative to an invariant dominating measure on X^ then b(x) is a modulator.
(ii) The norming constant a((g)) does not depend on g. If in
addition B(g) does not depend on g, which implies that B( ) = 0, then the conditional distribution of h given w is, on account of the exactness of (2.7),
152

Differential and Integral Geometry in Statistical Inference

p(h;h|w) = c ( w ) | j | * e ( h " l h ) w

153

(8.3)

where the norming constant does not depend on h.


Note that the form (8.3) is preserved under repeated sampling, i.e.
the conditional distribution of h is of the same "type" whatever the sample
size.
The von Mises-Fisher model for directional data with fixed precision
has this structure with w equal to the resultant length r, and as is wellknown the conditional model given r is also of this type, irrespective of
sample size. Other examples are provided by the hyperboloid model with fixed
precision and by the class or r-dimensional normal distributions with mean 0
and precision such that || = 1.
(iii) M is a (k,k-l) model.
For simplicity we now assume that M_ has all the above-mentioned
properties. There is then little further restriction in supposing that M^ is of
the form
p(x,) = b M e x p - a e ^ h ^ h ^ e ^ }

(8.4)

where is the index parameter, a is maximal invariant and e, and e , are


known nonrandom vectors. For (8.4) the log likelihood function is
l(h) = -ae A(h" h)e*

(8.5)

where we have written A for A" . Hence


(8.6)
where r is given by (6.23).

In this case, then, the conditional observed

geometries (^( ;,a),*( ;,a)) are all "proportional" for fixed , with a as
the proportionality factor. The geometric leaves of the foliation of M^, determined as the partition of M_ generated by the index parameter , are thus highly
similar.

In this connection see example 6.3.

APPENDIX 1

Construction of invariant measures


One may usefully generalize the concepts of invariant and relatively
invariant measures as follows. Let a measure on X_ be called quasi-invariant
with multiplier = (g,x) if gy and y are mutually absolutely continuous for
e\/ery gG and if
d ( g ) ( ) = (g 5 x)dy(x).
Furthermore, define a function m on X to be a modulator with associated multiplier (g,x) if m is positive and
m(g) = (g,x)m(x).

(Al.l)

Then, if y x is quasi-invariant with multiplier (g,x) and if m is a modulator


satisfying (Al.l) we have that
y = m" 1 y x

(A.2)

is an invariant measure on _X.


In particular, to verify that the measure y defined by (3.9) is
invariant one just has to show that m(y) = J (z\(u) is a modulator with associated multiplier J (q\(y) because, by the standard theorem on transformation of
integrals, Lebesgue measure is quasi-invariant with multiplier J ( q )(y)
Corresponding to the factorization G = HK there are unique factorizations g = hk
and gz = hk and, using repeatedly the assumption that K = G for every orbit
representative u, we find

154

Differential and Integral Geometry in Statistical Inference

155

In the last step we have used the fact that


J ( k )() = 1 for every kK.

(AT.3)

To see the validity of (A1.3) one needs only note that for fixed u the mapping
k -> J ,. x(u) is a multiplier on K and since K is compact this must be the
trivial multiplier 1. Actually, (A1.3) is a necessary and sufficient condition
for the existence of an invariant measure on _Y. This may be concluded from
Kurita (1959), cf. also Santal (1979), section 10.3.

APPENDIX 2

An equality of Jacobians under left factorizations


Lemma.

Let G = HK be a left factorization of G (as discussed in

sections 3 and 5 ) , let denote the natural action of G on H and let denote
left action of G on itself.
Proof.

Then J '(h)(e) = J ( h ) ^

for

a11

hH#

Let g = hk denote an arbitrary element of G.

Writing g

symbolically as (h,k) and employing the mappings and defined by


:g + h

:g -> k

we have, for any h'H,


(h')g = (h')(h,k) = ((h l h) (h'hk))
and hence the differential of (h')g is

ah

D(h')(g) =
9(h'hk)
ah

3(h'hk)
9k

from which we find, using (h'h) = (h')h and (h'k) = k,

(h')

(e)

(h')(e)

156

APPENDIX 3

An inversion result
The validity of formula (6.24) is established by the following
Lemma. Let G = HK be a left factorization of the group G with the
associated mapping :g = hk -> h (as discussed in sections 3 and 5). Furthermore, let h1 denote an arbitrary element of H. Then
3n(h~ ] h')*

a(h'"h)*
h=h'

h=h'

(A3.1)

Proof. The mapping h -> (h" h 1 ) may be composed of the three


mappings h -> h1

h, g -+ g" and , as indicated in the following diagram

H
where i indicates the inversion g -> g-1'. This diagram of mappings between differentiate manifolds induces a corresponding diagram for the associated differential mappings between the tangent spaces of the manifolds, namely

157

158

0. E. Barndorff-Nielsen

"Hi

>

TG .

Di
D

TH

n(hI-1h)

From this latter diagram and from the well-known relation


(Di)(e) = -I,
where I indicates the identity matrix, formula (A3.1) may be read off immediately.

Acknowledgements
I am much indebted to Poul Svante Eriksen, Peter Jupp, Steffen L.
Lauritzen, Hans Anton Salomonsen and Jorgen Tornehave for helpful discussions*
and to Lars Smedegaard Andersen for a careful checking of the manuscript.

REFERENCES

Amari, S.-I. (1982a). Differential geometry of curved exponential families curvatures and information loss. Ann. Statist. 10, 357-385.
Amari, S.-I. (1982b). Geometrical theory of asymptotic ancillarity and conditional inference. Biometrika 69, 1-17.
Amari, S.-I. (1985). Differential-Geometric Methods in Statistics. Lecture
Notes in Statistics 28, Springer, New York.
Amari, S.-I. (1986). Differential geometrical theory of statistics - towards
new developments. This volume.
Amari, S.-I. and Kumon, M. (1983). Differential geometry of Edgeworth expansion
in curved exponential family. Ann. Inst. Statist. Math. 35, 1-24.
Barndorff-Nielsen, 0. E. (1978a).

Information and Exponential Families.

Wiley, Chichester.
Barndorff-Nielsen, 0. E. (1978b). Hyperbolic distributions and distributions on
hyperbolae. Scand. J. Statist. 5_, 151-157.
Barndorff-Nielsen, 0. E. (1980). Conditionality resolutions. Biometrika 67,
293-310.
Barndorff-Nielsen, 0. E. (1982). Contribution to the discussion of R. J.
Buehler: Some ancillary statistics and their properties. J. Amer.
Statist. Assoc. 77, 590-591.
Barndorff-Nielsen, 0. E. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika 70, 343-365.
Barndorff-Nielsen, 0. E. (1984). On conditionality resolution and the likelihood ratio for curved exponential families. Scand. J. Statist. 11, 157159

160

0. E. Barndorff-Nielsen

170. Amendment Scand. J. Statist. 12. (1985).


Barndorff-Nielsen, 0. E. (1985). Confidence limits from c|j|^C in the singleparameter case. Scand. J. Statist. ^ 2 , 83-87.
Barndorff-Nielsen, 0. E. (1986a). Likelihood and observed geometries. Ann.
Statist. 14, 856-873.
Barndorff-Nielsen, 0. E. (1986b).

Inference on full or partial parameters

based on the standardized signed log likelihood ratio. Biometrika 73,


307-322.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1983a). Exponential models with
affine dual foliations. Ann. Statist. 11, 753-769.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1983b). Reproductive exponential
families. Ann. Statist. 11, 770-732.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1984). Combination of reproductive
models. Research Report 107, Dept. Theor. Statist., Aarhus University.
Barndorff-Nielsen, 0. E., Blaesild, P., Jensen, J. L. and Jorgensen, B. (1982).
Exponential transformation models. Proc. R. Soc. A 379, 41-65.
Barndorff-Nielsen, 0. E. and Cox, D. R. (1984). Bartlett adjustments to the
likelihood ratio statistic and the distribution of the maximum likelihood
estimator. J. R. Statist. Soc. B 46, 483-495.
Barndorff-Nielsen, 0. E., Cox. D. R. and Reid, N. (1986). The role of differential geometry in statistical theory.

Int. Statist. Review 54, 83-96.

Barut, A. 0. and Raczka, R. (1980). Theory of Group Representations and Applications. Polish Scientific Publishers, Warszawa.
Boothby, W. M. (1975). An Introduction to Differentiate Manifolds and
Riemannian Geometry. Academic Press, New York.
Burridge, J. (1981). A note on maximum likelihood estimation for regression
models using grouped data. J. R. Statist. Soc. B 43, 41-45.
Chentsov, N. N. (1972). Statistical Decision Rules and Optimal Inference.
(In Russian.) Moscow, Nauka. English translation (1982). Translation of
Mathematical Monographs Vol. 53. American Mathematical Society, Providence,
Rhode Island.

Differential and Integral Geometry in Statistical Inference

1 6 1

Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a


curved exponential family. Ann. Statist. Y\_, 793-803.
Eriksen, P. S. (1984a).

(k,l) exponential transformation models. Scand. J.

Statist. 21, 129-145.


Eriksen, P. S. (1984b). A note on the structure theorem for exponential transformation models. Research Report 101, Dept. Theor. Statist., Aarhus
University.
Eriksen, P. S. (1984c). Existence and uniqueness of the maximum likelihood
estimator in exponential transformation models. Research Report 103,
Dept. Theor. Statist., Aarhus University.
Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proc.
Roy. Soc. A 144, 285-307.
Hauck, W. W. and Donner, A. (1977). Wald's test as applied to hypotheses in
logit analysis. J. Amer. Statist. Ass. 72, 851-853. Corrigendum:
J. Amer. Statist. Ass. 75^ (1980), 482.
Jensen, J. L. (1981). On the hyperboloid distribution. Scand. J. Statist. 8,
193-206.
Kurita, M. (1959). On the volume in homogeneous spaces. Nagoya Math. J. 15,
201-217.
Lauritzen, S. L. (1986). Statistical manifolds. This volume.
Santal, L. A. (1979).

Integral Geometry and Geometric Probability.

Encyclo-

pedia of Mathematics and Its Applications. Vol. 1, Addison-Wesley, London.


Shuster, J. J. (1968). A note on the inverse Gaussian distribution function.
J. Amer. Statist. Assoc. 63, 1514-1516.
Vaeth, M. (1985). On the use of Wald's test in exponential families. Int.
Statist. Review 53, 199-214.

STATISTICAL MANIFOLDS
Steffen L. Lauritzen

1.

Introduction

165

2.

Some Differential Geometric Background

167

3.

The Differential Geometry of Statistical Models

177

4.

Statistical Manifolds

179

5.

The Univariate Gaussian Manifold

190

6.

The Inverse Gaussian Manifold

198

7.

The Gamma Manifold

203

8.

Two Special Manifolds

206

9.

Discussion and Unsolved Problems

212
215

10. References

^Institute for Electronic Systems, Aalborg University Center, Aalborg, Denmark

163

1.

INTRODUCTION

Euclidean geometry has served as the major tool in clarifying the


structural problems in connection with statistical inference in linear normal
models. A similar elegant geometric theory for other statistical problems
does not exist yet.
One could hope that a more general geometric theory could get the
same fundamental role in discussing structural and other problems in more
general statistical models.
In the case of non linear regression it seems clear that the
geometric framework is that of a Riemannian manifold, whereas in more general
cases it seems as if a non-standard differential geometry has yet to be
developed.
The emphasis in the present paper is to clarify the abstract
nature of this differential geometric object.
In section 2 we give a brief introduction to the notions of modern
differential geometry that we need to carry out our study. It is an extract
from Boothby (1975) and Spivak (1970-75) and we are mainly using a coordinatefree setup.
Section 3 is an ultrashort summary of some previous developments.
The core of the paper is contained in section 4 where we abstract the notion
of a statistical manifold as a triple (M,g,D) where M is a manifold, g is a
metric and D is a symmetric trivalent tensor, called the skewness of the
manifold. Section 4 is fully devoted to a study of this abstract notion.
Sections 5, 6, 7, and 8 are detailed studies of some examples of
165

166

Steffen L. Lauritzen

statistical manifolds of which some (the Gaussian, the inverse Gaussian and
the Gamma) manifolds are of interest because of their leading role in statistical theory, whereas the examples in section 8 are mostly of interest because
they to a large extent produce counterexamples to many optimistic conjectures.
Through the examples we also try to indicate possibilities for discussing
geometric estimation procedures.
In section 9 we have tried to collect some of the questions that
naturally arise in connection with the developments here and in related pieces
of work.

2. SOME DIFFERENTIAL GEOMETRIC BACKGROUND

A topological manifold ji is a Hausdorff space with a countable


base such that each point pM has a neighborhood that is homeomorphic to an
open subset of IRm. m is the dimension of M. and is well-defined. A differentiate structure on JM is a family
U

- U * )A

where IL is an open subset of .M and are homeomorphisms from U. onto an open

subset of IR m , satisfying the following:


(1) UU, = M

(2) for any , 9 : . o" is a C(IRm) function wherever it is well


c.

-i

defined
(3) if V is open, : V -> IR m is a homeomorphism, and o " , o

are

C wherever they are well defined, then (V,)U.


The condition (2) is expressed as . and

being compatible.
2

In \ery simple cases M. is itself homeomorphic to an open subset


of IR m and the differentiate structure is just given by (M.,) and all sets
(U. ,.) where U. is an open subset of M. and o " is a diffeomorphism.

The sets U are called coordinate neighborhoods and coordinates.


A

The pair (U , ) is called a local coordinate system.


A

1, equipped with a differentiate structure is called a differentia t e manifold or a C^-manifold.


A different!able structure can be specified by any system satisfying (1) and (2). Then there is a unique structure L[ containing the specified
167

168

Steffen L. Lauritzen

local coordinate system.


The differentiate structure gives rise to a natural way of defining a different!able function. We say that f: fi > IR is in C(M.) if it is
a usual C-function when composed with the coordinates:
f C(M) -* f o " C(,(U)) for all .

Important is the notion of a regular submanifold N M of M. A subset N. of M.


is a regular submanifold if it is a topological manifold with the relative
topology and if it has preferred coordinate neighborhoods, i.e. to each point
pj^ there is a local coordinate system (U , ) with pU. such that
A

0 ) ; (U ) = ]-,[m

i)

(p) = (0

11)

(u nn) = {(x\...,x n ,o

o ) , |xi|<e}

t inherits then in a natural way the differentiate structure from M by


(V , ) where
A

where (U. ,. ) is a preferred coordinate system.


A

All C(N.)-functions can then be obtained by restriction to \ of


C(M)-functions.
For pM^, C(p) is the set of functions whose restriction to some
open neighborhood U of p is in C(U). We here identify f and g C(p) if their
restriction to some open neighborhood of p are identical.
The tangent space T (H) to M. at p is now defined as the set of all
maps X : C(p) -* IR satisfying
i)
X p (f+3g) = X p (f)+X p (g)

, e IR

f g c (P)

x p (fg) = x p (f)g(p)+f(p)x p (g)

TO

One should think of X as a directional derivative. X is called a tangent


a
P
P
vector.
T (M[) is in an obvious way a vector-space and one can show that
dim(T (M)) = m.

Statistical Manifolds

169

For each particular choice of a coordinate system, there corresponds a canonical basis for T (M), with basis vectors being

A vector field is a smooth family of tangent vectors X = (X ,p M) where


X T (Nl). To define "smooth" in the right way, we demand a vector field X to
be a map:
X: C(M) - C(M)
i)

X(f+g) = X(f)+X(g)

x(fg) = x(f)g+fx(g)

,IR

f,gC(M)

and now we w r i t e
X p (f) = X ( f ) (P)

The vector fields on IA_ are denoted as X_(MJ. _X(M_) is a module over C(Mj: if
f,gC(M), X,YX(M) then
(fX+gY) (h) = fX(h) + gY(h)
is also in X_(M_). X_(M_) is a Lie-algebra with the bracket operation defined as
[X,Y](f) = X(Y(f)) - Y(X(f))
The Lie-bracket [ ] satisfies
[X,[Y,Z]] + [Y,[Z,X]] + [Z,[X,Y]] = 0
[X,Y] = -[Y,X]
[X+3X9,Y] = [X ,Y] + 3[X9,Y]
1

'

, N IR

[X,Y 1 +3Y 2 ] = [X,Y ] + B[X,Y 2 ]

,BIR

(Jacobi identity)
(anticommutativity)
(binearity)

Further one can easily show t h a t


[X,fY] = f[X,Y] + (X(f))Y .

The locally defined vector fields E., representing differentiation w.r.t. local
coordinates, constitute a natural basis for the module XjU), where U is a
coordinate neighborhood.
A covariant tensor D of order k is a C-k-linear map
D:

X(M)...X(M) > C(M),

170

Steffen L. Lauritzen

i.e.
D(X r ...,X k ) C(M),
D(X r ...,fX i +gY i ,X i + 1 ,...,X k )
= fD(x r ...,x k ) + gD(x 1 ,...,Y i ,x. + 1 ,...,x k ).
A tensor is always pointwise defined in the sense that if X . = Y ., then
D(X r ...,X k )(p) = D(Y r ...,Y k )(p).
This means that any equations for tensors can be checked locally on a basis
e.g. of the form E . These satisfy [E ,E.] = 0 and all tensorial equations hold
if they hold for vector fields with mutual Lie-brackets equal to zero. This is
a convenient tool for proving tensorial equations and we shall make use of it
in section 3.
A Riemannian metric g is a positive symmetric tensor of order two:
g(x,x) > o

g(x,) = g(Y,x)

Since tensors are pointwise, it can be thought of as a metric g on each of the


tangent spaces T (M_).
A curve = ((t),t[a,b]) is a C-map of [a,b] into M.. Note that
a curve is more than the set of points on it. It involves effectively the
parametrization and is thus not a purely geometric object.
Let now denote any vector field such that
(f)((t)) = j f ((t))

for all t[a,b]sfC(M)

The length of the curve is now given as


a
Curve length can be shown to be geometric.
An important notion is that of an affine connection on a manifold.
We define an affine connection as an operator v
v: _X(M) x _X(M) +
satisfying (where we write v Y for the value)

Statistical Manifolds

171

V(Y+3Z) = VY + |3VZ, ,6 IR

v (fY) = X(f)Y + fv Y

iii)

v f + g Z = fv Z + gv Z .

An affine connection can be thought of as a directional derivation of vector


fields, i.e. v Y is the "change" of the vector field Y in X's direction.
An affine connection can be defined in many ways, the basic reason
being, that "change" of Y is not well defined without giving a rule for comparing vectors in T (M) with vectors in T (M), since they generally are different
p
P
l "
2~
spaces.
An affine connection is exactly defining such a rule via the notion
of parallel transport, to be explained in the following. We first say that a
vector field X is parallel along the curve if
v X = 0 on ,
where again is any vector field representing -rr-.
Now for any vector X / \ T (a\(_M) there is a unique curve of
vectors
X ( t ) ,t[a,b],

X(t) T(t)(H)

such that v X = 0 on , i.e. such that these are all parallel, and such that
/ %%is equal to the given one. We then write
XX /
X

Y(b)

and say that defines parallel transport along . is in general an affine


map.
Note that depends effectively on the curve in general.
An affine connection can be specified by choosing a local basis
for the vector-fields (E. ,i=l.... ,m) and defining the symbols (C-functions)

by

172

Steffen L. Lauritzen

where we adopt the summation convention that whenever an index appears in an


expression as upper and lower, we sum over that index. Using the properties of
an affine connection we thus have for an arbitrary pair of vector-fields
X = f^., Y = g E.

v^fVg^+fVr^.
A geodesic is a curve with a parallel tangent vector field, i.e. where
V = 0 On .
Associated with the notion of a geodesic is the exponential map induced by the
connection.
For all pM_, X T (M_) there is a unique geodesic Y , such that
P
P
p
Xy (0) = p
(0) = X n (**)
X
p
p
P
This is determined in coordinates by the differential equations below together
with the initial conditions (**)
x k (t) + x ^ t ^ U K ^ W t ) ) = 0
where X Y (t) = (x (t),...,xm(t)) in coordinates.
P
Defining now for X T (M_)
exp{X p } = (1)
we have exptX } = Y (t).
X
p
P
The exponential map is in general well defined at least in a neighborhood of zero in T (M.) and can only in special cases be defined globally.
In general, geodesies have no properties of "minimizing" curve
length. However, on any Riemannian manifold, (i.e. a manifold with a metric
tensor g ) , there is a unique affine connection v satisfying
i)
v Y - v X - [X,Y] 0
)

Xg(Y.Z) = g(v Y,Z) + g(Y,v Z).

This connection is called the Riemannian connection or the Levi-Civita connection.

S t a t i s t i c a l Manifolds

173

Property i) is called torsion-freeness and property ii) means that


the parallel transport is Isometric, which is seen by the argument.
g(Y.Z) = g(v^Y,z) + g(Y,v^z) = 0 if v^Y = v^z = 0.
We can then write g(ii Y, Z) ( b ) = g ( Y , Z ) ( a ) or just g( Y, Z) = g(Y,Z).
If v is Riemannian, its geodesies will locally minimize curve length.
To all connections v there is a torsion free connection v such that
this has the same geodesies. All connections in the present paper are torsion
free, whereas not all of them are Riemannian.
When the manifold is equipped with a Riemannian metric, it is often
convenient to specify the connection through the symbols (C-functions) r... ,
where

Defining the matrix of the metric tensor and its inverse as


.. = g(E.,E.)

(g i j ) = (g-jj 1 ,

the symbols are related to those previously defined as

The Riemannian connection is given by

A connection defines in a canonical way the covariant derivative of


a tensor D as
(v D)(X 1 ,...,X k ) = XD(X r ...,X k ) - D(X 1 ,...,v X.,...,X | < ).
(v Y D) is again a covariant tensor of order k and the map

S(X,X r ...,X k ) = (v D)(X r ...,X k )


becomes a tensor of order k+1. The fact that the Riemannian connection preserves inner product under parallel translation can then be written as
(vg)(Y,Z) 0.
S i m i l a r l y , i f D i s a m u l t i l i n e a r map from _X(h1).. . x X W i n t o ^ ( M ) i t s

174

Steffen

covariant d e r i v a t i v e

L. Lauritzen

i s defined as

(vD)(X...,Xk) = vD(X...,Xk)

D(X...,vX.,...,Xk).

Such multilinear maps are called tensor fields.


An important tensor field associated with a space with an affine
connection is the curvature field, R: _X(M) x J((M) * _X(M) + _X(M)
R(X,Y)Z = v v Z - v v Z - v [ X j Y ] Z .
A manifold with a connection satisfying R = 0 is said to be flat.

If the

connection is torsion free, the curvature satisfies the following identities:


a)

R(X,Y)Z = -R(Y,X)Z

b)

R(X,Y)Z + R(Y,Z)X + R(Z,X)Y = 0


(Bianchi's 1st identity)

c)

(v R)(Y,Z s W) + (vR)(Z,X,W) + (vzR)(X,Y,W) = 0

(Bianchi's 2nd identity).


Strictly speaking, a) does not need torsion freeness.
On a Riemannian manifold, we also define the curvature tensor R as
R(X,Y,Z,W) = g(R(X,Y)Z,W)
where R is used in two meanings, both referring to the Riemannian connection.
The Riemannian curvature tensor satisfies
i)

R(X,Y,Z,W) = -R(Y,X,Z,W)

ii)

R(X,Y,Z,W) + R(Y,Z,X,W) + R(Z,X,Y,W) = 0

lii)

R(X,Y,Z,W) = -R(X,Y,W,Z)

iv)

R(X,Y,Z,W) = R(Z,W,X,Y)'

We shall use the symbol R also for the curvature tensor


R(X,Y,Z,W) = g(R(X,Y)Z,W),
when _M has a Riemannian metric g and a torsion-free but not necessarily
Riemannian connection v. Then i) and ii) are satisfied, but not necessarily
iii) and iv).
If (E. ) is a local basis for T (M_), the curvature tensor can be
calculated as

Statistical Manifolds

(r

175

1nn jk "

The sectional curvature is given as

,= g(R(x,Y)Y,x)

(
X

'Y

g(x,x)g(Y,Y)-g(x,Y) 2

and determines in a Riemannian manifold also the curvature.

If the curvature

satisfies i) to iv) the sectional curvature also determines R.


Two other contractions of the curvature tensor are of interest:
The Ricci-curvature
Cl

R(X,X) = | g(R(u i ,X)X,u i )


= g(X,X)m K(
u )
i=l
> -j

where (X/g(X,X),u1,...,u

) is an orthonormal system for T (M).

Finally the scalar curvature is


S(p) =

c 1 R(u i ,u i )

where u,,...
We then have the identity
9 u m is an orthonormal system in Tp(M).
i

S(p) = K ( u
i.J

).
i 3

If N^ is a regular submanifold of M_, the tangent space of H_ can in a natural way


be identified with the subspace of XjM.) determined by
X X_(N) X(M) +-> [f=g on H + X(f) = X(g) on H],
In that way all tensors etc. can be inherited to N_ by restriction.

If M^ has a

Riemannian metric, N^ inherits it in an obvious way, and this preserves curve


length, in the sense that the length of a curve in j^w.r.t. the metric inherited, is equal to that when the curve is considered as a curve in M.
An affine connection is inherited in a more complicated way:
We define
( N v Y)(p) = P p (v Y)(p)
where P is the projection w.r.t. g onto the tangent space T (U)E=T W
vector (vY)n which is not necessarily in T p (Nj.

In fact we define the

of the

17 6

Steffen

L. Lauritzen

embedding curvature of H_ relative to M^ as the tensor f i e l d X(N) x XjN} -> XjMj

' ) =v x "
or,equivalently, as
H N (X,Y,Z) = g ( H N ( X , Y ) , Z )

Where X,Y X(N), 1 X(N)^ (or Z X(M)).


If H N 0 we say that N. is a totally geodesic submanifold of M_. A
totally geodesic submanifold has the property that any curve in H_ which is a
geodesic w.r.t. the connection on N_, also is a geodesic in M_.

3.

THE DIFFERENTIAL GEOMETRY OF STATISTICAL MODELS

A family of probability measures P on a topological space _X inherits its topological structure from the weak topology. Most statistical models
are parametrized at least locally by maps (homeomorphisms)
: U + c l R m

where U is an open subset of and an open subset of IRm.

From this para-

metrization we get P^ equipped with a differentiate structure, provided the


various local parametrizations are compatible. Considering for a while only
local aspects, we can think of P as {P ,}. We let now f(x,) denote the

density of P w . r . t . a dominating measure and assume these to be C-functions

of . Under suitable regularity assumptions we can now equip P^with a


Riemannian metric by defining l(x,) = log f (x,) and
g j() = g ( E E j ) p = ^ ( E . D E j d ) ) .
The metric is the Fisher information and different parametrizations define the
same metric on P_. Similarly we can define a family of affine connections (the
-connections) on P^ by the expressions
?

i j k = "ij - f i j k

IR

-where

r... is the Riemannian connection.


1 JK

The Fisher information as a metric was first studied by Rao (1945)


and the -connections in the case of finite and discrete sample spaces by
Chentsov (1972). Later the -connections were introduced and investigated
independently and in full generality by Amari (1982).
177

178

Steffen L. Lauritzen

For a more fair description of the history of the subject (the


above is indecently short), see e.g. the introduction by Kass in the present
monograph, Amari (1985) and/or Barndorff-Nielsen, Cox and Reid (1986).
Two of these connections play a special role:
The exponential connection

(for =l) and

the mixture connection

(for =-l).

The exponential connection has r...= 0 when expressed in the


canonical parameter in an exponential family, and similarly when we express
r.ijk
., (the
mixture connection) in the mean value coordinates of an exponential
^
family r.i JK
., 0. Further we have the formulae
)) and
T

ijk "

which often are useful for computations.


These structures are in a certain sense canonical on a statistical
manifold. Chentsov (1972) showed in the case of discrete sample spaces that
the -connections were the only invariant connections satisfying certain invariance properties related to a decision-theoretic approach. Similarly, the
Fisher information metric is the only invariant Riemannian metric. These results have recently been generalized to exponential families by Picard (1985).
On the other hand, similar geometric structures have recently
appeared such as minimum-contrast geometries (Eguchi, 1983) and the observed
geometries introduced by Barndorff-Nielsen in this monograph.
The common structure that seems to appear again and again in current statistical literature is not standard in modern geometry since it involves
study of the interplay between a Riemannian metric and a non-Riemannian connection or even a whole family of such connections.
It seems thus worthwhile to spend some time on studying this
structure from a purely mathematical point of view. This has already been done
to some extent by Amari (1985).
mathematical structures.

In the following section we shall outline the

4.

STATISTICAL MANIFOLDS

A statistical manifold is a Riemannian manifold with a symmetric


and covariant tensor D or order 3. In other words a triple (NUg,D) where M_ is
an m-dimensional C-manifold, g is a metric tensor and D: X.(MJ x X_{M) x ^{M) ->
C(MJ a trilinear map satisfying
D(X,Y,Z) = D(Y,X,Z) = D(Y,Z,X)
(=D(X,Z,Y) = D(Z,X,Y) = D(Z,Y,X))
D is going to play the role T... in the previous section. We use D to distinguish the tensor from the torsion field. D is called the skewness of the
manifold.
Instead of D we shall sometimes consider the tensor field B defined
as
g(Bf(X,Y),Z) = D(X,Y,Z).
We have here used that the value of a tensor field is fully determined when the inner product with an arbitrary vector field Z is known for all
Z.
The above defined notion could seem a bit more general than necessary, in the sense that some Riemannian manifolds with a symmetric trivalent
tensor D might not correspond to a particular statistical model.
On the other hand the notion is general enough to cover all known
examples, including the observed geometries studied by Barndorff-Nielsen and
the minimum contrast geometries studied by Eguchi (1983).
Further, all known results of geometric nature for statistical
manifolds as studied by Amari and others can be shown in this generality and
179

180

Steffen L. Lauritzen

it seems difficult to restrict the geometric structure further if all known


examples should be covered by the general notion.
Let now (N[,g,D) (or(NUg,D)) be a statistical manifold. We now
define a family of connections as follows:
v Y = v Y - |D(X,Y)

(3.1)

where v is the Riemannian connection. We then have


3.1

Proposition v as defined by (3.1) is a torsion free connection.

It is the

unique connection that is torsion free and satisfies


(vg)(Y,Z) = D(X,Y,Z)

(3.2)

Proof: That v is a connection: Linearity in X is obvious. Scalar linearity


in Y as well. We have
v (fY) = v (fY) - |D(X,fY) = X(f)Y + fv Y.
Torsion freeness follows from symmetry of D:
v Y - v X - [X,Y] = v Y - v X - [X,Y]

-f [D(X,Y) - D(Y,X)] = 0.

That v satisfies (3.2) follows from


(vg)(Y,Z) = Xg(Y,Z) - g(v Y,Z) - g(Y,v Z)
= (vg)(Y,Z) + D(X.Y.Z) = 0 + D(X,Y,Z).
If ^ is torsion free and satisfies (3.2) we obtain:
1)

Xg(Y.Z) = g(v Y,Z) + g(Y,v Z) + D(X,.Y,Z)

Zg(X,Y) = g(v Z,Y) + g(v Z,X) + oD(X,Y,Z)


+ g([Z,X],Y) + g([Z,Y],X)

111)

Yg(z.x) = g(^ Y z,x) + g(v Y,z) + D(x,,z)

- g([x,Y],z)
Calculating now i) - ii) + iii) we get
Xg(Y,Z) - Zg(X,Y) + Yg(Z,X) = D(X 5 Y,Z)
-g([Z,X],Y) - g([Z,Y],X) - g([X,Y],Z) + 2g(v Y,Z).

Statistical Manifolds

181

Since this equation also is fulfilled for v we get


g(v Y,Z) = g(v Y,Z), whereby v = v.
0 _
Obviously v = v, the Riemannian connection.
To check what happens when we make a parallel translation we first
consider the notion of a conjugate connection (Amari, 1983).
Let (MUg) be a Riemannian manifold and v an affine connection. The
conjugate connection v* is defined as
g(v* Y,Z) = Xg(Y,Z) - g(Y,v Z)
3.2 Lemma v* is a connection,

(3.3)

(v*)* = v.

Proof: Linearity in X is obvious. So is linearity in Y w.r.t. scalars. We


have
g(v*(fY),Z) = Xg(fY,Z) - g(fY,v Z)
= X(f)g(Y,Z) + fXg(Y.Z) - fg(Y,v Z)
= g(X(f)Y + fv* Y,Z).
And further
g((v*)*Y,Z) = Xg(Y.Z) - g(v* Z,Y)
= Xg(Y,Z) - {Xg(Z,Y) - g(vY,Z)} = g(v Y,Z).
If we now let ,n* denote parallel transport along the curve we obtain:
3.3 Proposition
g(X,*Y) = g(X,Y)
Proof: Let X be v-parallel along and Y v*-parallel. Then we have
g(x,Y) = g(vo<,Y) + g ( x , v M ) = o.
In words Proposition 3.3 says that parallel transport of pairs of vectors w.r.t.
a pair of conjugate connections is "isometric" in the sense that inner product
is preserved.
Finally we have for the -connections, defined by (3.1):
3.4 Proposition

(v)* = v .

182

Steffen L. Lauritzen

Proof:
g(v Y,Z) = g(v Y,Z) - I D(X,Y,Z)
g(Y,v Z) = g(Y,v Z) + | D(X,Z,Y)
Adding and using the symmetry of D together with the defining property of the
Riemannian connection we get
g(v Y> Z ) + 9(Y^v Z) = Xg(Y.Z)

(3.4)

The relation (3.4) is important and was also obtained by Amari (1983). If we
now consider the curvature tensors R and R* corresponding to v and v* we obtain
the following identity:
3.5

Proposition
R(X,Y,Z,W) = -R*(X,Y,W,Z)

(3.5)

Proof: Since we shall show a tensorial identity, we can assume [X,Y] = 0 as


discussed in section 1. Then we get
XYg(Z,W) = X(g(vZ,W) + g(Z,v* W))
= g(v v Z,W) + g(vZ,v*W)
+ g(vZ,v*W) + g(Z,v*v*W).
By alternation we obtain
0 = [X,Y]g(Z,W) = XYg(Z,W) - YXg(Z,W)
= R(X,Y,Z,W) + R*(X,Y,W,Z).
Note that the Riemannian connection is self-conjugate which gives the well
known identity for the Riemannian curvature tensor, see section 1.
Consequently we obtain
3.6 Corollary The following conditions are equivalent
i)

R = R*

11)

R(X,Y,Z,W) = -R(X,Y,W,Z)

Proof: It follows directly from (3.5).


And, also as a direct consequence:
3.7 Corollary v is flat if and only if v* is.

Statistical Manifolds

183

The identity ii) is not without interest and we shall shortly


investigate for which classes of statistical manifolds this is true. Before we
get to that point we shall investigate the relation between a statistical manifold and a manifold with a pair of conjugate connections.
We define the tensor field D-,, and the tensor D, in a manifold with
a pair (v,v*) of conjugate connections by
f^X.Y) = v* Y - v Y
D^X.Y.Z) = g(B (X,Y),Z).
We then have the following
3.8 Proposition
1

f v is torsion free, the following are equivalent

* is torsion free

ii)

D-. is symmetric

iii)

v=

Proof: That D, is symmetric in the last two arguments follows from the
calculation
D^X.Y.Z) = g(v*Y,Z) - g(v Y,Z)
= Xg(Y>Z) - g(Y,v Z) - [Xg(Y,Z)-g(Y,v*Z)]
= D (X,Z,Y)
The difference between two connections is always a tensor field,

i) -* ii)

follows from the calculation


g(v*Y-v*x-[x,],z) = g(v -v x-[x,],z)
+ D^X.Y.Z) - D^Y.X.Z).
That iii) * i) is obvious since then v*=2v-v.
To show that i) > iii) we use the uniqueness of the Riemannian connection. We define
v=
and see that this is torsion free, when v and v* both are. But

184

Steffen L.

Lauritzen

g(v Y,Z) + g(Y,v Z) = ^g(v Y,Z) + ^g(v*Y,Z)


+ hg(,)

+ ^g(Y,v Z) = Xg(Y,Z)

showing that v is Riemannian and thus equal to v.


Suppose now that v is given with v* being torsion free. We can then
define a family of connections as

and we obtain
3.9

Corollary

- 1
-1
v* = v, v = 7 , v = v*.

Proof: It is enough to show v = v. But


4

V ^ V ^ " ^'h^ V*
We have thus established a one-to-one correspondence between a statistical
manifold (NUg,D) and a Riemannian manifold with a connection v whose conjugate
v* is torsion free, the relation being given as
D(X,Y) = v*Y - v Y
v Y = v Y - ^ 2 D(X 9 Y).
In some ways it is natural to think of the statistical manifolds as
being induced by the metric (Fisher information) and one connection (v) (the
exponential), but the representation (M,g,D) is practical for mathematical
purposes, because D has simpler transformational properties than v.
By direct calculation we further obtain the following identity for
a statistical manifold and its -connections
3.10 Proposition
g(v Y,Z) - g(v Z,Y) = g(v Y,Z) - g(v Z,Y)
Proof: The result follows from
g(v Y,Z) - g(v Z,Y) = g(v Y,Z) - g(v Z 9 Y)
- |D(X,Y,Z) + |D(X,Z,Y)
and the symmetry of D.

(3.6)

Statistical Manifolds

185

We shall now return to studying the question of identities for the


curvature tensor of a statistical manifold. We define the tensor
F(X,Y,Z,W) = (vD)(Y,Z,W)
where D is the skewness of the manifold, and v is the Riemannian connection. We
then have
3.11

Proposition The following are equivalent

i)

R = R for all lR

ii)

F is symmetric

Proof: The proof reminds a bit of bookkeeping. We are simply going to establish the identity
R(X,Y,Z,W) - R(X,Y,Z,W) = {F(X,Y,Z,W) - F(Y,X,Z,W)}

(3.7)

by brute force.
Symmetry of F in the last three variables follows from the symmetry
of D. We have
2F(X,Y,Z,W) = 2XD(Y,Z,W)
-2(D(vY,Z,W) + D(Y,vZ,W) + D(Y,Z,vW))

Since v = ^(v + v) and oD(X,Y,Z) = g(v Y,Z) - g(v Y,Z) we further get
2D(vY,Z,W) = 2g(v z W,v Y) - 2g(v z W,v Y)
= g(v z W,v Y) + g(v z W,v Y)
- g(v z W,v Y) - g(v z W,v Y),
and similarly for the two other terms. Further we get
2XD(Y,Z,W) = 2X(g(vZ,W) - g(v Z,W))
= 2g(v v Z,W) - 2g(v v Z,W)
+ 2g(v Z,v W) - 2g(v Z,v W)
Collecting terms we get the following table of terms in 2F(X,Y,Z,W), where
lines 1-3 are from 2XD(Y,Z,W), 4 and 5 from 2D(vY,Z,W) 6 and 7 from
2D(Y,vZ,W) and 8 and 9 from 2D(Y,Z,v W).

186

Steffen L. Lauritzen

Table of terms of 2F(X,Y,Z,W)


with + sign

with - sign

- -

1.

2g(v vZ,W)

2g(v v Z,W)

2.

g(v ,v w)

g(v Z,v W)

3.

g(v Y,v w)

4.

g(v z w,v )

5.

g(v z w,v )

6.

g(v W,v Z)

8.

g(v Z,v W)

9.

g w v^- 9 ^ v " /

g(v W,v Z)
-

g(v W,v Z)
-

g(v z,v w)

g(v z w,v )

g(v z w,v )

g(v W,v Z)

g(v z,v w)

7.

g(v Z,v W)

Lines 4 and 5^ disappear by torsion freeness and alternation. Lines ^ + 9_ add up


to zero. Lines _3 + 7_ disappear by alternation. Lines 6^+8^ also. What is left
over are only terms from line ]_ whereby
2F(X,Y,Z,W) - 2F(Y,X,Z,W)
= 2R(X,Y,Z,W) - 2R(XSY,Z,W)
and the result and (3.7) follows.
A statistical manifold satisfying this kind of symmetry shall be
called conjugate symmetric. We get then immediately
3.12 Corollary The following is sufficient for a statistical manifold to be
conjugate symmetric
'

0
3 n + such that R = 0,
i.e. that the manifold is n-f

As shown e.g. in Amari (1985), exponential families are 1 - flat


and therefore always conjugate symmetric.
In a conjugate symmetric space, the curvature tensor thus satisfies
all the identities of the Riemannian curvature tensor, i.e. also

Statistical Manifolds

187

R(X,Y,Z,W) = -R(X,Y,W,Z)
R(X,Y,Z,W) = R(Z,W,X,Y)j
This implies as mentioned earlier that the sectional curvature determines the
curvature tensor.
We shall later see examples of statistical manifolds actually
generated by a statistical model that are not, conjugate symmetric.
It also follows that the condition

3 0 t 0 such that FP = R

(3.9)

is sufficient for conjugate symmetry.


Amari (1985) investigated the case when the statistical manifold was
CIQ (and thus - Q ) flat in detail, showing the existence of local conjugate coordinates ( ) and (n ) such that r. k = 0 in the -coordinates and its conjugate
i\ k = 0 in the -coordinates.
Further that potential functions () and () then exist such that
g ^ t) = E-Ej) g i j (n) = E Ej))
and the - and -coordinates then are related by the Legendre transform:
1 = E.(())

ni

= E.(())

() + () - i = 0.

In a sense g-flat families are geometrically equivalent to exponential families..


If N^ is a regular submanifold of (M_,g,D), the tensors g and D are
inherited in a simple way (by restriction). The -connections are inherited by
orthogonal projections on to the space of tangent vectors to _N, i.e. by the
equation
g(v Y,Z) = g(v Y,Z) for X,Y,Z X(N).

(3.10)

It follows from (3.10) that the -connections induced by the restriction of g


and D to 1(R) are equal to those obtained by projection (3.10). This consistency condition is rather important although it is so easily verified.
A submanifold is totally -geodesic (or just -geodesic) if

188

Steffen L. Lauritzen

v Y X(N) for all X,Y X ( N ) .


If the submanifold is -geodesic for all we say that it is geodesic. We then
note the following
3.12

Proposition A regular submanifold Jj is geodesic if and only if there

exist , j 2 such that R is ,-geodesic and ^-geodesic.


Proof: Let X,Y X(H) and Z T (N) 1 p N_.
Then N is .-geodesic, i=l, 2 iff
1

gfvj,z) p = g(v,z) = o

for all such X,Y,Z. But since

g(vY,Z) = g(vY,Z) - f D(X,Y,Z)


this happens if and only if D(X,Y,Z) = 0 for all such X,Y,Z, whereby N_ is geodesic iff it is .-geodesic, i=l,2.
In statistical language, geodesic (-geodesic) submanifolds will be
called geodesic (-geodesic) hypotheses. A central issue is the problem of
existence and construction of -geodesic and geodesic foliations of a statistical manifold.
A foliation of (M^g,D) is a partitioning
=

* &h

of the manifold into submanifolds N of fixed dimension n(<m). IL are called


the leaves of the foliation.
The foliation is said to be geodesic (or -geodesic) if the leaves
are all geodesic (or -geodesic).
It follows from Proposition 3.12 that geodesic foliations of full
exponential families (and of -flat families) are those that are affine both in
the canonical and in the mean value parameters, in other words precisely the
affine dual foliations studied by Barndorff-Nielsen and Blaesild (1983). In
the paper cited it is shown that existence of such foliations are intimately
tied to basic statistical properties related to independence of estimates and
ancillarity. Proposition 3.12 shows that the concept itself is entirely geo-

Statistical Manifolds

189

metric in its nature.


It seems reasonable to believe that the existence (locally as well
as globally) of foliations of statistical models could be quite informative. It
plays at least a role when discussing procedures to obtain estimates and ancillary statistics on a geometric basis.
Let N_ be a submanifold of M and suppose that pN[ is an estimate of
p, obtained assuming the model M. Amari (1982, 1985) discusses the -estimate
of p assuming N[ as follows.
To each point p of JV we associate an ancillary manifold A (p)
A (p) = exp (Tp(N)1)

where exp is the exponential map associated with the -connection and T (Nj~ is
the set of tangent vectors orthogonal to N at p. In general the exponential map
might not be defined on all T p (Nj~, but then let it be maximally defined,
p is then an -estimate of p, assuming N^ if
P A (p).
Amari (1985) shows that if M_ is -flat and N_ is --geodesic, then the -estimate
is uniquely determined and it minimizes a certain divergence function.
This suggest that it might be worthwhile studying procedures that
use the --estimate for -geodesic hypotheses N_, and call such a procedure
geometric estimation.

In general it seems that one should study the decomposi-

tion of the tangent spaces at pN[ as


T p (M) = T p ( N ) T p ( N
and especially the maps of these spaces onto itself induced by -parallel transport of vectors in T (N), - parallel transport of vectors in the complement,
both along closed curves in JY.
It should also be possible to define a teststatistic in geometric
terms by a suitable lifting of the manifold N_, see also Amari (1985). Things
are especially simple in the case where M^ has dimension 2 and N^ has dimension 1
and we shall try to play a bit with the above loose ideas in some of the
examples to come.

5.

THE UNIVARIATE GAUSSIAN MANIFOLD

Let us consider the family of normal distributions N(, ), i.e.


the family with densities
I p^l
-,
2
f(x;y,) = V2
exp{ * (x-) },IR,>0
w.r.t. Lebesgue measure on IR. This manifold has been studied as a Riemannian
manifold by Atkinson and Mitchell (1981), Skovgaard (1984) and, as a statistical
manifold in some detail by Amari (1982, 1985).

Working in the (,) parametri-

zation we obtain the following expressions for the metric, the -connections
and the D-tensor (skewness) expressed as T... (cf. Amari, 1985).
. 1 /I 0\

"7lo 2)
lll

122

-,

212

12

= -(1+cO/ 3

22

222
T

^ =
-<

= 2n

221

21

]21

1 1 2 = (l-)/3

= T

122 " T212

= T

121 = T 2

2/

221 "
T

222

The -curvature tensor is given by


R 1212 = (l- 2 )/ 4 ,
190

8 /

Statistical Manifolds

191

so the manifold is conjugate symmetric, and the scalar (sectional) curvature by

For = 0 (the Riemannian case) we have K(,p) = -1/2 and the manifold is the
space of constant negative curvature (Poincar's haifplane or hyperbolic space).
Note that it also has constant -curvature for all although nobody knows what
that implies, since such objects have r\e\/er been studied previously.
To find all -geodesic submanifolds of dimension 1 we proceed as
follows. Let (e,E) denote the tangent vector fields
e = -2-

E = -?L

8 *

If we have = Q constant on N_, XjNj is spanned by E. Since rl2 o = 0 we have

VpE = f E for all ,


and thus that the submanifolds
N

= {(,) |=Q},oIR

are geodesic submanifolds and the family


(N ,IR)

(4.1)

constitutes a geodesic foliation of the Gaussian manifold.


If is non-constant on NU we must be able to parametrize N[ locally
as
(t,(t)), tl S I R .
The tangent space to H is then spanned by
N = e + E

where we have let (t) = + (t) and extended to a function defined on all of
the manifold by (x,y):= (x).

ot

where we have used torsion freeness and the fact that e() = , E() = 0. Using

k we get
now the expressions for r..,
N

192

Steffen L. Lauritzen

If this again has to be in the direction of N, we must have


1+ 2
-

1- . l+2 2
= -x +

do

which by multiplication with 2 reduces to the differential equation


2 + 2 2 = (-1)
This is most conveniently solved by letting u = , whereby = 2 + 2 and
the equation becomes as simple as
= -1 +-> u(t) = ^(-l)t2 + Bt + C,

(4.3)

such that the -geodesic submanifolds are either straight lines ( = 1) or


p

parabolas in the (y, )-parametrisation.


The special case = 1, B = 0 corresponds to the manifolds
|( = {(y,) |= Q } 5 Q IR +
that give a 1-geodesic foliation.
Another special case is the submanifolds of constant variation
coefficient
V^ = {(y,)|=y},IR+
p
that we now see are -geodesic if and only if = l+2 by inserting into (4.3).
V are now connected submanifolds but is composed by two non-connected submani

folds V+ and V "


V + = {(,)|>0>nV

V " = {(y ,) |y>0}V .

The (V ,V ") manifolds do not represent -geodesic foliations since they are
not -geodesic for the same value of . For = 0 we see that the geodesic sub2
2
manifolds are parabola's in (y, ) with coefficient -k to y , a result also
obtained by Atkinson and Mitchell (1981) and Skovgaard (1984).
Consider now the hypothesis (y,) V , i.e. that of constant variation coefficient. We shall illustrate the idea of geodesic estimation in this
example as described at the end of section 3.
2
V is =l+2 geodesic. The ancillary manifolds to be considered
are then --geodesic manifolds orthogonal to V .
An arbitrary --submanifold is the "parabola"

Statistical Manifolds

193

= (-(l+2)y2+B+C)!5
p
which follows from (4.3) with = -(l+2 ). Its tangent vector is equal to
[-2(l+2)y+B]E+e.

e+E = ^

The tangent vector of the hypothesis is


e+E.
They are at right angles at d J o' n^ ^

anc

1+J: [-2(l+2)y0+B]=0 ~ B=(l+2 2 )y 0 .


The ancillary manifold intersects at (Q,Q) if and only if
-(l+ 2 )y 2 +(l+2 2 )y 2 +C= 2 2 *"*c = 0
p
The -(l+2 )-geodesic ancillary manifolds are thus given as
W^ = {(t,(t))|tl }, yIfK{0}
(WQ = {(0,)|IR+>)
where 2 (t) = -(l+ 2 )t 2 + (l+2 2 )yt and

=
y

i]it?_,O[
V

1f <0.

1+Y

2
The manifolds W , ylR actually constitute a -(l+2 ) -foliation of the Gaussian
manifold.

To see this, let (x,s) be an arbitrary point in M_. If we try to

solve the equation


(,s 2 ) = (t,-(l+ 2 )t 2 +(l+2 2 )t)
we obtain exactly one solution for x=f, given as

(l+ 2 )x 2 +S 2

i.e. a linear combination of x and

(l+ 2 )x+ 2

y, as determined by (4.4) is the geometric estimate of y, when x


and s denote the empirical mean and standard deviation of a sample x,,...,x .
It is by construction (see Amari (1982)) consistent and first-order efficient.

194

Steffen L. Lauritzen

A picture of the situation is given below in three different parametrizations:


2
-2
(>), (, ) , and (, ) :

-2

Fig. : Geometric estimation with constant coefficient of variation, (y,)param.

Fig. 2: Geometric estimation, (y, )-param.

Statistical Manifolds

195

Fig. 3: Geometric estimation, (y,-) param.


To obtain a geometric ancillary and test-statistic we proceed as follows:
We take a system of vectors on the hypotheses whose directions are
2

-(l+2 ) -parallel and whose lengths are equal to one. Further they are to be
orthogonal to the hypothesis (and thus tangent to the estimation manifolds).
The directions should thus be given as

v = (v v 2 ) = -e + j^ E
To obtain unit length, we get ||v|

V 2
when =, and our orthogonal field is thus
V(y) [v^J.

a[-,i]

where a = (2 /(2 +1)) . To find the exponential map


-(l+2 2 ))
tV.()} = (f(t,),(t,))

we shall solve the equations

196

Steffen L. Lauritzen

(t,) = -(H )f(t,)


df

+ (l+2 )f (t ,)

(4.5)

[ ' ) = -a and f(O,) =

f=2f I (i+) ++ f = -2/2f

(4.6)

since only the speed of the geodesic has to be determined.

(4.6) is easily seen

to be equivalent to
2
f = K"
Inserting (4.5) into this we obtain

for some KfO.

(4.7)
2

f = K(-(HV + (l+22)f2

and separation of variables yield


2
2
Z[-(1+Y
)u 2 + (l+2 2 )u] 2 du = Kt+C

Substituting v=u/ we get


2

G(*2-E-*-) = Kt+C

\4.o)

where G(x) = /J [-(l+ 2 )v 2 +(l+2 2 )v] 22 dv.


Using the initial condition f(0,)= we get
C=42+1G(D
and the condition f(0,) = -ay yields together with (4.7)
K = 42 (O,)(-a) = - a 4 Y V 2 + 1 ,
whereby

4 2 + 1 G{fl)

__ _ a 4 2 4 2 + l t + y +

Y
and dividing by 4v +1 yields thus

()

6(1)

and therefore f(t,) = h(t) where


2
h(t) = G ' V a ^ t + G(D).
Inserting this into (4.5) yields
(t.) = /-(l+ 2 )h(t) 2 +(H2 2 )h(t)
which is linear in . If we now interpret points of same "distance" from the

Statistical Manifolds

197

hypothesis as those where t is fixed and only y varying, we see that s/x is in
one-to-one correspondence with t. We shall therefore say that s/x is the
geometric ancillary and this it also is the geometric test statistic for the
hypothesis =.
It is of course interesting, although not surprising, that this
test statistic (ancillary) is obtained solely by geometric arguments but still
equal to the "natural" when considering the transformation structure of the
model.

6. THE INVERSE GAUSSIAN MANIFOLD

Consider the family of inverse Gaussian densities


X

f(;x.*>-

V 3 / 2 , ,>o

w.r.t. Lebesgue measure on IR + . We choose to study this manifold in the parametrization (n,), where
n =x "

, i.e.

f(x n.) =

The metric tensor and the skewness tensor can now be calculated either by using
their definition directly or by calculating these in the (,) coordinates and
using transformation rules of tensors. We get

1
g=

and T 1 1 2 = 0 ,

ni

2 2 2

The Riemannian connection is now determined by

in

= -O+)/(2n 3 ), Tm

2 2 ] = (l-o)/(2 2 ), ] 2 2 =
222

2 1 2

= (3-l)/(2 2 n)

Multiplying with the inverse metric we get


198

= -(l+)/(2 2 )

S t a t i s t i c a l Manifolds

T\ = -(l+)/

199

f, = j 2 - ^ = 0

2 2 = (3-l)/2.
To find all geodesic submanifolds of dimension one we first notice
2

that since r,, = 0, the manifolds


- Q

are -geodesic for all , i.e. geodesic and they constitute a geodesic foliation
of the inverse Gaussian manifold. Because
$ X = " 1
they correspond to hypotheses of constant expectation.
Consider now a submanifold of the form (n(t),t), i.e. with tangent
N given as
N = e + E, where e = ^ , E = .
We extend by letting (x,y): = n(y), i.e. such that e() = 0, E() = . Then

e +

2 V

+ V

/'*
1+
1+ 2
l-\
l - \ , / 1+
1+
,, 3 - l \
jE
= ( " + ) )
+ ((- +
We now have V..N = hN i f f

r
[

1+

, 3-l

ir

] = [

1+

which reduces to the differential equation

3-l

"1

-1

This is first solved for = i :

(t) = - I t log t + C,t + C2

, 1-

~r

200

Steffen L. Lauritzen

l-3

For j -j we get by letting u = t


equation

=(-l)t

that u satisfies the differential

.^2

_^

Whereby

n(t) - %
For =l (the exponential connections) we get the parabolas:
n(t) = Bt 2 + C
and for =-l (the mixture connection) we get the curves:
(t) = -t + B/t + C.
In the Riemannian case (=0) we get
n(t) = -2t + B / T + C
that are parabolas in (/~,).
The curvature tensor is given by

The manifold is thus conjugate symmetric (we already know, since it is an exponential family) and the sectional curvature is
( 12 ) = -R 1 2 1 2 /(g 1 1 g 2 2)

-O- 2 )/2

Note that the Riemannian curvature (=0) is again constant equal to -h9 as in
the Gaussian case. In fact the -curvature is exactly as in the Gaussian case.
We can map the inverse Gaussian manifold to the Gaussian by letting
V = J2Q

= /2

and this map is a Riemannian isometry. However, it does not preserve the skewness tensor and thus the Gaussian and inverse Gaussian manifolds do not seem to
be isomorphic as statistical manifolds, although they are as Riemannian manifolds.
Corresponding to the hypothesis of constant coefficient of variation, we shall investigate the submanifold corresponding to the exponential

Statistical Manifolds

201

transformation model /" = , fixed, i.e.

^y

>0

which in the (,)-parametrization is a straight line through the origin (as


const, coeff. of var.)
{ = } = V
This submanifold is -geodesic if and only if
=

2(-i:

1 ^ ~ - 2+37'

The tangent space to V is spanned by e + E, and the orthogonal --geodesic


submanifolds are given by solving the equations
l-3

(5.1)
to get the intersecting point and orthogonality at ( ^ ~ ' &,?) gives
3+l

l-9

'

Combining this with (5.1) we get C=0, i.e. the estimation manifolds are given as
3+
t+\

2(+) +

1-3
.

l-9
The manifolds W^, >0 again constitute a --foliation of the inverse Gaussian
D

manifold as is seen by solving the equations

which gives t= Q , and


3-l
=

"

"0 0
2

9 -1 Q
8 Q j

2
3+l

This again determines a geometric estimate 8 of from a sample x-,,...,x


the inverse Gaussian distribution, and this is obtained by letting

from

202

Steffen

^*

L. L a u r i t z e n

]/

0= n"x 1 " * '

and inserting = (2+)/(2+3) into the expression given above.

7.

THE GAMMA MANIFOLD

Consider the family of gamma densities


3

f(x;,e) = (3/y) x "Vr(3) exp{- ^ }

>0, 3>0

w.r.t. Lebesgue measure on IR + . The metric tensor is obtained by direct calculation in the (y,)-parametrization as

"T

g-

0 (3)

where (e) = D log r(&) - 1/3.


The Riemannian connection is now obtained by
^ijk = ^ 3 i 9 j k

+ 3

j 9 ik " 3 k g ij-' t 0

be

f 1 ] = -3/ 3 ; f 1 1 2 = - l/(2 2 ); f 1 2 1 = f 2 1 1 = l/(22)

222 = J 5 ' ( ) ' 221= 122= 212= '


1
Similarly we calculate r... by the formula
1

ijk = t ( E 1 E j ( ) E k ( ) )
rni

= -23/3

122

t0 be

r 1 2 1 = 1/2
1

212

222

221 =

1
and the skewness tensor T\ ... = 2 ( r ^ - k - r . . k )
T

2
T

221

121 = T 211 = " 1 / p

122 = T 212 =
203

222

'

204

Steffen L. Lauritzen

whereby the -connections are determined to be

lll

2= 7 T
<^y

i.

=r
= l^1
2
^11 9 2

r
= il2
*222
2

2y

122 212 221 = '


Multiplying by the inverse metric we get

!i= -
[

23

22

and all other symbols equal to zero.


The curvature is by direct calculation found to be
1212

4 ()

The space is conjugate symmetric and therefore the curvature tensor is fully
determined by the sectional (scalar) curvature which is
K

< ) = -R

g 1 g 2 2 = 1-2 [()+'()3

Note that this is even for =0 different from the two previous examples in that
the curvature is non-constant and truly dependent on the shape parameter 3.
To find all geodesic submanifolds we proceed as follows:
If =un is constant on
N,
X(N)
is spanned by the tangent vector E corresponding
to differentiation w.r.t. the second coordinate. Since

these submanifolds are geodesic for all values of and constitute a geodesic
foliation of the gamma manifold.
Considering the manifold given by 3=3Q> its tangent space is spanned by e and since

Statistical Manifolds

205

these are -geodesic if and only if =l.


In general let us consider a hypothesis (submanifold) of the type
(f(t),t).

Its tangent vector is


f e + E and e(f) = 0, E(f) = f

we have
vj e + E (f e + E) = f 2 v e e = 2fv e E + f e + v E E

. J+SL+ f U2L+ f ] e+ [f

= L

ll_+ J ^
2

( )

()

If we now let =t =f and multiply the coefficient to E by f we obtain the


equation

which unfortunately does not seem soluble in general. For =l the solutions are
f(t) = t/(At+B).

8.

TWO SPECIAL MANIFOLDS

In the present section we shall see that things are not always as
simple as the previous examples suggest, but even then we seem to be able to get
some understanding from geometric considerations.
First we should like to notice that when we combine two experiments
independently with the same parameter space, both the Fisher information metric
and the skewness tensors are additive. Let X^P Q Y^P and let A., B. denote the
derivative of the two log-likelihood functions
A

i= dr1og

f (x;)

B =

i drj" 1 o g g(y;)

Then the skewness tensor is to be calculated as

= EAjAjAk

EB.B.B,

since all terms containing both A's and B's vanish due to the independence and
the fact that EA = EB. = 0.

If we now let X^N(, ), Y^N(,l) and X and Y independent we get


by adding the information and skewness tensors that in the (,)-parametrization

and that, as in the Gaussian manifold, we have


T

lll T 122 " T 212 T22 =

2 = ^

222 =8 / 3 '

Since derivatives of the metric are as in the Gaussian case, so are the i\ ...symbols:
206

Statistical Manifolds

122

212

121

207

a
221

3
222
2 2 2 =-2(l+2)/ .

-(1+]

But the -connections are truly different which is seen by looking at the r..symbols:
^ = (l-)/(2+3)

]2 = ^

= -(l+)/

2 2 = -2(l+2)/(2+3)
and all others equal to zero. Considering now the curvature tensor we get
K

1212

4 ,(2+
9 + 2x )

2112

and this is clearly different from R-|212 w ^ e r e ^ t'S s P a c ^ is not conjugate


symmetric. The sectional curvature is not determining the curvature tensor be1
cause e.g. R-ioi?^ ^ u t t'e s P a c e 1 S n o t 1 -flat since
RK
= -R
= - M + ) [2(l-) + 2 (2 + )] _K
K
1221
1212 {l a>
4, 9 . 2
~ " 2121
a

From standard properties of the curvature tensor we have R...

= 0, but we

obtain by direct calculation that

1211

= R

2111

= R

1222 = R 2122 = s

such that the above components are the only ones that are not vanishing.
If we try to find the geodesic submanifolds we first observe that
l
because r 0 = 0 for all . the submanifolds

= {(,)|=0)

are totally geodesic for all , and thus constitute a geodesic foliation of the
manifold.

Following the remarks at the end of section 4, relating geodesic

foliations to the affine dual foliations of Barndorff-Nielsen and Blaesild


(1983), it is of interest to know that also in this example, the maximum likelip

hood estimates of and are independent as expected from the foliation. We


shall now proceed to find the remaining geodesic manifolds.
If we consider manifolds of the type (t,f(t)) with tangent vector

208

Steffen L. Laritzen

e + f E we get

,a

o+f
F ( e + fE )= v e e+ 2 f v ^ E +
+ fE
e+f E
e f VCE E
= _ 2 ( 1 + a ) e + ( -a _ 2(l+2a) 'fZ^
f
2+ 3 2+ 3

Multiplying the coefficient to e with f and inserting =f we get the equation

Multiplying on both sides with f(2+f ) and collecting terms gives


2f 2 f 2 (l+) + 2ff + ff 3 + 2f 2 = -1
and this does not seem to have a particularly nice solution.
Note that f(t) and t is not a solution since then f= f=0 and we
obtain the equation for :
2 4 t 2 (l+) + 2 2 = -1
which can only hold when = -1 and then we get
2 2 = -2
which is impossible.
In this example the "constant coefficient of variation" does also
not have any simple group transformational properties.
It seems then of interest to see what happens if we consider the
model with X^N(y, ), Y^N(log ,l) which is related to the example just considered but where the "constant coefficient of variation" js^ transformational. The
model is also transformational itself (the affine group). By the same argument
as before the skewness tensor becomes identical to that of the univariate Gaussian manifold. The metric, however, becomes

i 2

whereby we calculate the Riemannian connection to be


112

= 1/3

? 2 2 2 = -3/

f 2 1 1 = f 1 2 1 = -1/ 3
f

n l

= 1 2 2 = 2 1 2 = 221 = 0.

Statistical Manifolds

209

The -connections are


r 1 2 2 = d-j/ 3

??9

r121 = r211 =

^
=

("3-4)/

= 1 9 9 = 9 1 9 = 9 9 = 0,

or in the r. .-symbols:
^ = (l-)/3

^ = ^ =

\2 = -(3+4)/3.
The curvature tensor can be calculated to be
2
- (l-)(3+)
K
1212 "
4

S
_
1221

(U)(3-)
4

So we do indeed again have a manifold that is not conjugate symmetric. All


other components are again vanishing apart from R o n ? ' R 212V

^e s P a c e

S n o t

flat for any value of .


Considering the problem of finding all geodesic submanifolds we have
the same situation as earlier in that
N^

= {(,)|=0}

together constitute a foliation that is geodesic for all values of , again in


accordance with the independence of and .
Consider now a submanifold of the type [t,f(t)] with tangent
e + f E. We get

ve+fE
^r(e+fE) = vee + 2fveE + f v_E
E + fE
#

. _ (1+)e+ [^L . 34f2+ ]


Multiplying the coefficient to e by f and everything by 3f and reducing, we
obtain the following differential equation:
(3+2)f2 + 3ff = -1 .
For =0 (the Riemannian case), we get
2

3f + 3ff = -1.

210

Steffen L. Lauritzen

Letting f = f, u = f we obtain as in the Gaussian case the equation


2

= - I -* u = - -jt + At + B,
2
i.e. again parabolas in the (y, ) parametrization but with a different coefficient to t 2 .
Note that, in fact, considered as a Riemannian manifold there is no
essential difference between this and the univariate Gaussian manifold, since
we have constant scalar Riemannian curvature equal to
-

--i.

i.e. again a hyperbolic space.


If I {l,--^} the following special parabolas are solutions:

2 = Q2 is 1-geodesic.

0 9

9T^

'
+

B + B -^

,B

For = -3/2 no^ parabolas are geodesic. The equation

then reduces to
f --j_ -1 ,
ff
f

the general solution to which cannot be obtained in a closed form.


If we consider the transformation submodel of "constant coefficient
of variation" = corresponding to f(t)= t, we get the equation
(3+2)2 + 0 = -1.
Solving this for we find the following peculiarity:
= (3 2 +l)/(l-2 2 )

if 2 ^

but if = /2/2, the equation has no solution!! In other words, all "constant
variation coefficient submanifolds" of the manifold studies are -geodesic for
2
suitably chosen except one ( = h).
A reasonable explanation for this is at present beyond my imagination. Is there a missing connection (=~)? Have I made a mistake in the calculations? Or is it just due to the fact that the phenomenon is related to how
this model is a submodel of the strange two-dimensional model. In any case,
there is a remarkable disharmony between the group structure and the geometry.

Statistical Manifolds

211

To go a bit further we consider the three-dimensional manifold


((,,)-parametrized) obtained from considering X ^ N(, ), Y ^ N(,l). The
metric for this becomes
0
g=
0

and the skewness-tensor and the -connections are identical to the Gaussian
case when only indices 1 and 2 appear and all involving the third coordinate
are equal to zero. Letting (e,E,F) denote the basis vectors for the tangent
space determined by coordinatewise differentiation, we consider now the "constant coefficient of variation" submanifold:
(t,t, log t ) , t IR + }
with tangent-vector N = e + E + xF, and we get
V

NN

2V e E
Inserting the expressions for the -connections we obtain
+
N = . 2 ^ te - ( 2t

v N

F
t)E - .2
t

If this derivative shall be in N's direction we must have

but also
2

= g l + (l+2) - 2 = which is impossible. We conclude thereby that this transformational model is


not -geodesic for any , considered as a submodel of the full exponential
model.

9.

DISCUSSION AND UNSOLVED PROBLEMS

The present paper seems to raise more questions than it answers.


We want to conclude by pointing out some of these, thereby hoping to stimulate
research in the area.
1. How much structure of a statistical model is captured by its
"statistical manifold", the manifold being defined through expected geometries
as by Amari, minimum contrast geometries as by Eguchi or observed geometries as
by Barndorff-Nielsen? On the surface it looks as if only structures up to
third order are there and as if one should include symmetric tensors of higher
order to capture more.
2. Some statistical manifolds (ML ,g-. ,D-,) and (Mo,g 2 ,D 2 )

are

"alike", locally as well as globally. Various types of alikeness seems to be of


some interest. Of course the full isomorphism, i.e. maps from M., to NL that
preserves both the Riemannian metric and the skewness tensor. But also maps
that preserve some structure, but not all could be of interest, in analogy with
the notion of a conformal map in Riemannian geometry (maps that preserve angles,
i.e. the metric up to multiplication with a function). There are several possibilities here. Isometries that preserve the skewness tensor up to a scalar
or up to a function. Maps that preserve the metric up to scalars and/or functions and do and do not preserve skewness etc. etc.
3. In connection with the above there remains to be done a lot of
work on classification of statistical manifolds in a pure mathematical sense,
i.e. characterize manifolds up to various type of "conformal" equivalence,
"conformal" here taken in the senses described above. A classic result is that
212

Statistical Manifolds

213

two Riemannian manifolds are locally isomorphic if they have identical curvature
tensors. Do similar things hold for statistical manifolds and their -curvatures? Note that the inverse Gaussian and Gaussian manifolds seem to be alike
but not fully isomorphic. Results of Amari (1985) seem to indicate that -flat
families are yery similar to exponential families. Are they in some sense
equivalent? There might be many interesting things to be seen in this direction.
4. Some statistical manifolds seem to have special properties. As
mentioned above we have e.g. -flat families, but also manifolds that are
conjugate symmetric or manifolds with constant -curvatre both for a particular
and for all at the same time. Which maps preserve these properties? Can
they in some sense be classified?
5. How does the geometric structures behave when we form marginal
and conditional experiments? Some work has been done on this by BarndorffNielsen and Jupp (1984, 1985).
6. Is there a decomposition theory for statistical manifolds. We
have seen that there might be a connection between the existence of geodesic
foliations and independence of estimates. There might be a de Rham-like theory
to be discovered by studying parallel transports along closed curves in flat
manifolds?
7. Chentsov (1972) showed that the expected geometries were the
only ones that obeyed the axioms of a decision theoretic view of statistics, in
the case of finite sample spaces. It seems of interest to investigate generalizations of this result, both to more general spaces and to other foundational
frameworks. Picard (1935) has generalized the result to the case of exponential
families and has some results pertaining to the general case.
8. What insight can be gained by studying the difference between
observed and expected geometries?
9. How is the relation between the geometric structure of a Lietransformation group and the geometric structure of its transformational statis-

214

Steffen L. Lauritzen

tical models?
Other questions and problems are raised by Barndorff-Nielsen, Cox,
and Reid (1986) and in the book by Amari (1985).
Acknowledgements
The author is grateful to Ole Barndorff-Nielsen, Preben Blaesild,
and Erik Jtfrgensen for discussions relevant to this manuscript at various
stages.

REFERENCES

Amari, S.-I. (1982). Differential geometry of curved exponential families curvatures and information loss. Ann. Statist. ]_0, 357-385.
Amari, S.-I. (1985). Differential-Geometrical Methods in Statistics.

Lecture

Notes in Statistics Vol. 28, Springer Verlag. Berlin, Heidelberg.


Atkinson, C. and Mitchell, A. F. S. (1981). Rao's distance measure. Sankhya
A 41 345-365.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1983). Exponential models with
affine dual foliations. Ann. Statist. JJ_ 753-769.
Barndorff-Nielsen, 0. E., Cox, D. R.and Reid, N. (1986). The role of differential geometry in statistical theory.

Int. Statist. Rev, (to appear).

Barndorff-Nielsen, 0. E. and Jupp, P. E. (1984). Differential geometry, profile


likelihood and L-sufficiency.

Res. Rep. 113. Dept. Theor. Stat., Aarhus

University.
Barndorff-Nielsen, 0. E. and Jupp, P. E. (1985). Profile likelihood, marginal
likelihood and differential geometry of composite transformation models.
Res. Rep. 122. Dept. Theor. Stat., Aarhus University.
Boothby, W. S. (1975). An Introduction to Differentiate Manifolds and Riemannian Geometry, Academic Press.
Chentsov, N. N. (1972). Statistical Decision Rules and Optimal Conclusions (in
Russian) Nauka, Moscow. Translation in English (1982) by Amer. Math. Soc.
Rhode Island.
Efron, B. (1975). Defining the curvature of a statistical problem (with discussion). Ann. Statist. 3_ 1189-1242.
215

216

Steffen L. Lauritzen

Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a


curved exponential family. Ann. Statist. J 2 793-303.
Picard, D. (1985).

Invariance properties of the Fisher-Rao metric and Chentsov-

Amari connections using le Cam deficiency. Manuscript. Orsay, France.


Rao, C. R. (1945).

Information and the accuracy attainable in the estimation of

statistical parameters. Bull. Calcutta Math. Soc. 37 81-91.


Skovgaard, L. T. (1984). A Riemannian geometry of the multivariate normal
model. Scand. J. Statist. 11 211-223.
Spivak, M. (1970-75). Differential Geometry Vol. I-V.

Publish or Perish.

DIFFERENTIAL METRICS IN PROBABILITY SPACES


C. R. Rao*

1. Introduction

219

2. Jensen Difference and Entropy Differential Metric

222

3. The Quadratic Entropy

226

4. Metrics Based on Divergence Measures

228

5. Other Divergence Measures

231

6. Geodesic Distances

234

7. References

238

Department of Mathematics and Statistics, University of Pittsburgh,


Pittsburgh, PA

217

1.

INTRODUCTION

In an early paper (Rao, 1945), the author introduced a Riemannian


(quadratic differential) metric over the space of a parametric family of probability distributions and proposed the geodesic distance induced by the metric
as a measure of dissimilarity between probability distributions. The metric
was based on the Fisher information matrix and it arose in a natural way
through the concepts of statistical discrimination feee also Rao, 1949, 1954, 1973
pp. 329-332, 1982a). Such a choice of the quadratic differential metric, which
we will refer to as the information metric, has indeed some attractive properties such as invariance for transformation of the variables as well as the parameters. It also seems to provide an appropriate (informative) geometry on the
probability space for studying large sample properties of estimators of parameters in terms of simple loss functions as demonstrated by Amari (1982, 1983),
Cencov (1982), Efron (1975, 1982), Eguchi (1983, 1984), Kass (1981) and others.
Kass (1980, Ph.D. thesis) explores the possibility of using differential geometric ideas in statistical inference.
The geodesic distances based on the information metric have been
computed for a number of parametric family of distributions in recent papers by
Atkinson and Mitchell (1981), Burbea (1986), Kass (1981), Mitchell and
Krzanowski (1985), and Oiler and Cuadras (1985).
In two papers, Burbea and Rao (1982a, 1982b) gave some general
methods for constructing quadratic differential metrics on probability spaces,
of which the Fisher information metric belonged to a special class. In view of
the rich variety of possible metrics, it would be useful to lay down some
219

220

C. R.

Rao

criteria for the choice of an appropriate metric for a given problem. Amari has
stated that a metric should reflect the stochastic and statistical properties
of the family of probability distributions. In particular he emphasized the
invariance of the metric under transformations of the variables as well as the
v
parameters. Cencov (1972) shows that the Fisher information metric is unique
under some conditions including invariance. Burbea and Rao (1982a) showed that
the Fisher information metric is the only metric associated with invariant
divergence measures of the type introduced by Ciszar (1967). However, there
exist other types of invariant metrics as shown in Section 3 of this paper.
The choice of a metric naturally depends on a particular problem
under investigation, and invariance may or may not be relevant. For instance,
consider the space of multinomial distributions, = {(p,,...,p ): p. > 0,
p. = 1}, which is a submanifold of the positive orthant, X = {(x-.,...,x ):
x. > 0} of the Euclidean space R n .

A Riemannian metric on X automatically pro-

vides a metric on the submanifold .

In a study of linkage and selection of

gametes in a biological population, Shahshahani (1979) considered the metric


n X. 29
9
ds2 = I ^
d x
(1.1)
1 Xi
which induces the information metric on . This metric provided a convenient
framework for a discussion of certain biological problems. However, Nei (1978)
considered a distance measure associated with the Euclidean metric
ds 2 = dx 2

(1.2)

which he found to be more appropriate for evolutionary studies in biology. The


metric induced on by (1.2) is not the Fisher information metric. Rao (1982a,
1982b) has shown that a more general type of metric
a .dx.dx,

(1.3)

called the quadratic entropy is more meaningful in certain sociometric


and biometric studies.
The object of the present paper is to provide some general methods
of constructing Riemannian metrics on probability spaces, and discuss in

Differential Metrics in Probability Spaces

221

particular the metric generated by the quadratic entropy which is an ideal


measure of diversity (see Lau, 1985 and Rao, 1982b), and has properties similar
to the information metric, like invariance. We also give a list of geodesic
distances based on the information metric computed by various authors (Atkinson
and Mitchell, 1981; Burbea, 1986; Mitchell and Krzanowski, 1985; Oiler and
Cuadras, 1985 and Rao, 1945).
The basic approach adopted in the paper is first to define a measure
of divergence or dissimilarity between two probability measures, and then to use
it to derive a metric on M, the manifold of parameters, by considering two
distributions defined by two contiguous points in M. We thus provide a method
for the construction of an appropriate geometry or geometries on the parameter
space for discussion of practical problems. Some divergence measures may be
more appropriate for discussing properties of estimators using simple loss
functions while others may be appropriate in the study of population dynamics in
biology.

It is not unusual in practice to study a problem under different

models for observed data to examine consistency and robustness of results. The
variety of metrics reported in the paper would be of some use in this direction.

2. JENSEN DIFFERENCE AND ENTROPY DIFFERENTIAL METRIC

Let v be a -finite additive measure defined on a -algebra of


subsets of a measurable space X^, and P^ be the usual Lebesgue space of v measurable density functions,
P_= p(x): P(x) > 0, xeX,, | p(x)dv(x) = 1} .

(2.1)

We call H: P+R an entropy (functional) on P^ if


(i) H(p) = 0 when p is degenerate,
(ii) H(p) is concave on P_.
In such a case, with > 0 , > 0, + = 1, Rao (1982a) defined the Jensen
difference between p and qP^ as
J(,; p,q) = H(p + q) - H(p) - H(q) .

(2.2)

The function J: P^ * P+R is non-negative and vanishes if p = q (iff p = q when


H is strictly concave).

If the entropy function H is regarded as a measure of

diversity within a population, then the Jensen difference J can be interpreted


as a measure of diversity (or dissimilarity) between two populations. For the
use of Jensen difference in the measurement, apportionment and analysis of diversity between populations, the reader is referred to Rao (1982a, 1982b).
Let us now consider a subset of probability densities characterized
by a vector parameter
P = {p(x,): p(x,)P, M, a manifold in R n }
D

and assume that p(x,) is a smooth function admitting derivatives of a certain


order with respect to and differention under the integral sign. For convenience of notation, we write

222

Differential Metrics in Probability Spaces


P( ,) = pfi, H() = H(p ), H(,) = H(p

223

+ vp.)

J(,) = H(,) - H() - yH()


where ,M.

(2.3)

Putting = + d and denoting the i-th component of a vector

with a subscript i, we consider the formal expansion of J(,+d),

nn ^2,/^ ,_

nnn ,,3,/

,_%

= 2T g"j()d1dJ. + 37 c".jk()did..dk+...

(2.4)

In (2.4), the coefficients of the first order differentials vanish since J(,)
has a minimum at = , and the notation such as 3 J(,=)/3.3. is used for
replacing by after carrying out the indicated differentiations.
From the definition of the J function, it follows that the (g..) is
a non-negative definite matrix and obeys the tensorial law under transformation
of parameters. We define the matrix and the associated differential metric
|_j

II

(g..) and g.d d.

(2.5)

as the H-entropy information matrix and H-entropy differential metric respectively.

We prove the following theorem which provides an alternative computa-

tion of the H-information matrix directly from a given entropy H.


Theorem 2.1
(2.6)
1J

Proof:

r j

By definition
2

H.() = 9 J(,=)
=

3 2 H(,=) _ ^ 3 2 H(=)
3 3
3 3

^2 7)

Since J(,) attains a minimum at =


3H(,=) _
3H()

3^
^TJ
J
Differentiating both sides of (2.8) with respect to . we have
32H(,=)
3.3.
1

32H(,=) _
3.3.
i

3 2 H()
3.3.
1

,9 ftx
8
^

(2

(9
K

x
}

224

C. R. Rao

which gives (2.6), and the desired result is proved.


Let us consider a general entropy function of the type

H(P) = - j h(p)dv(x)

(2.10)

where h" 5 the second derivative of h, is a non-negative function. Then using


(2-6)

H , *_ h , _
gtJ

32H(,=)

9()

"

Ti \j
T

3 h(p +p )

3.3.
' J

dv(x)

= u [h"(pj ^ ^ p *
J
3. 3 j
If h(x) = x log x, leading to Shannon's entropy, then
(2 1 2 )

become the elements of Fisher's information matrix.

If h(x) = (-l)~ (x -x),

j 1, we have the -order entropy of Havrda and Charvat (1967) and


3 loq p

3 loq p

(2.13)

which provide the elements of -order entropy information matrix, and the
corresponding differential metric given in Burbea and Rao (1982a, 1982b).
We prove Theorem 2.2 which gives alternative expressions for the
coefficients of the third order differentials in the expansion of J(,).
Theorem 2.2.
C

r 33H(,=) + 33H(,=) + 33H(,=)j

(^

Proof: By definition
c

33J(,=)

() =

3 H(,=) _

From (2.9), writing i = j and j = k we have

3 H()

(2

Differential

Metrics in Probability

3 H(,=)
93

225

9 H(,=)
33

Spaces

3 H()
98

Differentiating with respect to .


3

3 H(,=) + 3 H(,=) + 3 H(,=) + 3 H(,=) =


3.3 3.
3. 3 -3.
3.3.3^
3.3 3.
which gives (2.14) as equivalent to (2.15).
Let H be Shannon's entropy.

3 H()
3.3-3.

This proves Theorem 2.2.

Then, an easy computation gives

c i j k - x[r{]j[+ (l-x)T ]+ i\l\+ (l. )T 1jk ]+ [r{J]+ d-.)T 1 j k ]}


(2.16)
where

2
m

ijk

a log pfl 3 log p


3.3.
3.J * ijk C V
1

3 log p 3 log p 3 log p


3.
3.
3, ; '

(2.17)
Adopting the notation of Amari for -connexion
()
= ( D+ I Z T
1

ij k

S'jk

2 'ijk

the expression (2.16) can be w r i t t e n

When = = 2*, (2.18) becomes


. 1 r (0)+
C

ijk

ijk

(0)+

jki

(0)-.

ikj

f 2 9

19

Remark 1. In the definition of the Jensen difference (2.2), we


used apriori probabilities and for the two probability distributions p and
q which have some relevance in population studies.

But in problems of statis-

tical inference, a symmetric version may be used by taking = = ^

3.

THE QUADRATIC ENTROPY

The quadratic entropy was introduced in Rao (1982a) as a general


measure of diversity of a probability distribution over any measurable space.
It is defined as a function Q:
Q(p) =

P+R
K(x,y)p(x)p(y)dv(x)dv(y)

(3.1)

where K(x,y) is symmetric, non-negative and conditionally negative definite,


i.e.,

nn

II K(x x )a a < 0
11

for any choice of (x, ,...,x ) and of (a-,,...,a ) such that a-.+...+a = 0, with
the further condition K(x,y) = 0 if x = y.

It was shown in Rao (1982b, 1984)

that the quadratic entropy is concave over P^ and its Jensen difference has
nice convexity properties which makes it an ideal measure of diversity.

In

view of its usefulness in statistical applications, we give explicit expressions


for the quadratic differential metric and the connection coefficients associated
with the quadratic entropy, in the case of the parametric family P_.
u

From Theorem 2.1, the (i,j)-th element of the Q-information matrix


is

ij^'

3 Q(p o + p
3i3J.

Observing that
Q(p + p j = I K(x,y)[p(x,)+p(x,)][p(y,)+p(y,)]dv(x)dv(y),

we find the explicit expression for (3.2) as

226

(3.2)

Differential Metrics in Probability Spaces

= -2 J K(x,y) M

dv(x)8v(y)

227

(3.3)

Using the expression (2.14), we find on carrying out the necessary computations

where
8p(x>]
X
ijk = IMK (x
' yy); 3,J
k

ij

(3.4)

It is of interest to note that the expressions (3.3) and (3.4) are invariant for
transformations of both the parameters and variables.
For further properties of quadratic entropies, the reader is referred to Lau (1984) and Rao (1984).

4.

METRICS BASED ON DIVERGENCE MEASURES

Burbea and Rao (1982a, 1982b), Burbea (1986) and Eguchi (1984)
have considered metrics arising out of a variety of divergence measures between
probability distributions.

A typical divergence measure is of the form

D F (p ,p ) = j F[p(x,),p(x,)]dv(x)

(4.1)

where F satisfies the following conditions:


(i)

F( , ) is a C -function of R + x R +

(ii)

F(x, ) is strictly convex on R + for every x R + ,

(iii) F(x,x) = 0 for every x R + 5


(iv)

3F(x

= x

^ = 0 for every x R+.

Let us consider the expansion


k

and obtain explicit expressions for g. and c...


Theorem 4.1.

Let

- 9 2 F(x,y) F
_ 3 2 F(x,y) rF
_ 32F(x,y)
11 ' 3 2
' 2 " 3x3y
' 22 " 3 y 2

3y

Then

'i
228

...

(4.2)

Differential Metrics in Probability Spaces

9P
F

3P

3P

3P

2 2 [ P ' P - " - 3 . 9 J . ^ + 3 i 8 | < T

9P
+

229

3P

3j.3k " 8 T ] d v ( x ) *

The results are established by straight forward computations.


Let us consider the directed divergence measure of Csiszar (1967),
which plays an important role in problems of statistical inference,
D ( p > P ) = j p(x,) f(H|^||) dv(x)
where f is a convex function.

(4.3)

In this case
g
g

ii j

aj

(4.4)
where g. are the elements of Fisher's information matrix. Thus a wide class
of invariant divergence measures provide the same informative geometry on the
parameter manifold.

However, the c. .. coefficients may depend on the particular

convex function f chosen as shown below.


f

I r, \ _

9D

where v).) and T.., are as defined in (2.17).


The results (4.4) and (4.5) have consequences in estimation theory,
specially in the study of second order efficiency. While a large number of
estimation procedures lead to first order efficient estimates (i.e., having the
same asymptotic variance based on the elements of Fisher information matrix),
they are distinguishable by different second order efficiencies of the derived
estimators (see Rao, 1962).
If f is a convex function, then

f*(u) = u f

230

C. R. Rao

is also convex, and the measure (4.3) associated with f+f* is

D*(P,P) = { [pf(j) + Pf(^)]dv(x)

(4.6)

which is symmetric in and . However, we may define (4.5) as a symmetric


divergence measure without requiring f to be a convex function but satisfying
the condition that xf(x" ) +f(x) is non-negative on R + . In such a case
) = 2f(i)gij(e)

4;() = 2f([r|J + r + rj(;j]+ 3f" ). j k


Remarks on Sections 2, 3 and 4. As pointed out by a referee, a unified treatment of the results in these three sections is possible by considering a general
dissimilarity measure D : p x p -> {0y} satisfying
(a) D(p Q ,p.) is a c function of ,,
u

(b) D(p,p) = 0 for every p x p,


Then putting
i

and differentiating D .I

jk

" 3.3j9k

e t c # 9

= 0 yields

jj -

giving expressions for g.. and c... for a general D. However, the approach
IJ

1 JK

adopted in the paper enabled a discussion of the construction of the distance


measures D through more basic functions like quadratic entropy, general entropy,
cross entropy, and divergence between probability measures. The results expressed in terms of the basic functions are of some interest.
It is also possible to regard the dissimilarity measures of
Section 3 and 4 as having the common form
D(p,q) = x F(p(x),q(x),p(y),q(y))dv(x,y)
where v i s a symmetric measure on X x X_. However, the expressions for g.. and
c... are not simple.

5.

OTHER DIVERGENCE MEASURES

In the last section, we considered the f-divergence measure which


led to the Fisher information metric. A special case of this measure is the
city block distance, or the overlap distance (see Rao, 1948, 1982a),
D Q (p ,p ) = j |p(x,)-p(x,)|dv(x)

(5.1)

obtained by choosing f(x) = l-min(x,l), which admits a direct interpretation in


terms of errors of classification in discrimination problems. However, this is
not a smooth function and no formula of the type (4.7) is available to determine the coefficients of the differential metric.

But in some cases, it may

turn out that


D

o ( f V V = Do(')

is a smooth function of and in which case


8 2 D n (,=))
(5

2)

In the case when p(x,) is a p-variate normal density with mean y and fixed
variance covariance matrix , the coefficient (5.2) can be easily computed to be
proportional to 1 J , the (isj)-th element of " , which is indeed the (i,j)-th
element of the Fisher information matrix.

The same result holds for any ellip-

tical family, as then D Q (,) is a function of the Mahalanobis distance between


and (see Mitchell and Krzanowski, 1985).
Let p(x,) be the density of a uniform distribution in the interval
[0,].

Then it is seen that

231

232

C. R.

Rao

D Q (,) = 2(1 - i) if <


= 2(1 - f) if > .

(5.3)

Although this is not a differentiate function, it is seen that


2

is the metric associated with (5.3).


Another general divergence measure which has some practical
applications is

which is indeed a smooth function if is so. In this case

Another measure of interest is the cross entropy introduced in Rao


and Nayak (1985).

If H is any entropy function, then the cross entropy of p

with respect to p. was defined as


H[p +(p -p )]-H(p )
D(P , P ) = H(p ) - H(p ) - urn 4
^
*- .

(5.4)

Let

H(p) = - j h(p)dv(x)
as chosen in (2.10). Then (5.4) reduces to

D(p ,p ) = - j h( P )dv(x) - J h'(p)(p-p)dv(x) + I h(p)dv(x) .


Then

which is the same as the h-entropy information matrix derived in (2.10), apart
from a constant.

Similarly

Differential Metrics in Probability Spaces


h
ijk

(1)
(1) + (1)
+

ijk ikj
jki
'ijk

where
m

3 log P A 9 log p

3 log p 8 log p 9 log p

233

6. GEODESIC DISTANCES

In Rao (1945) it was suggested that the information metric could be


used to obtain the geodesic distances between probability distributions. Given
any quadratic differential metric
ds 2 = g.,()d.d.
'j

(6.1)

where the matrix (g..) is positive definite, the geodesic curve = (t) can
J
in principle be determined from the Euler-Lagrange equations
n
..
nn
g 1 d - + Z i j k 1 j = 0, k = 1
n
(6.2)
and from the boundary conditions
(t-j) = , (t 2 ) = .
In (6.2), the quantity

ijk

i [ a 9 j k
I

air 9 ki " a|r g-j]


J

(6 3)

and is known as the "Christoffel symbol of the first kind."


By definition of the geodesic curve = (t), its tangent vector
2
= (t) is of constant length with respect to the metric ds . Thus
nn
Jl g. .. = constant .
(6.4)
11 J 3
The constant may be chosen to be of value 1 when the curve parameter t is the
arc length parameter s, 0 < s < S Q , with (0) = , (s Q ) = and s Q = g(,)
is the geodesic distance between and .
Aitkinson and Mitchell (1981) describe two other methods of deriving
geodesic distances starting from a given differential metric. The distances
234

Differential Metrics in Probability Spaces

obtained by these authors in various cases are given below.

235

In each case we

give the probability function p(x,) and the associated geodesic distance of
(,) based on the Fisher information metric.
() Poisson distribution
p(x,) = e" X/x!, x = 0,1,..., > 0
g(,) = 2\/Q - /" I
(2)

Binomial distribution (n fixed)


P(x,) = x ( l - ) n ~ \ x = 0 , 1 , . . . , n , 0<<l
g(,) = 2/n|sin"/e - sin" /|
= 2/n cos'^/" + /(1-)(1- ] .

(3) Exponential distribution


p(x,) = e ~ x , x > 0
g(,) = I log - og| .
(4) Gamma distribution (n fixed)
p(x,) = n [ r ( n ) x n " V x , x > 0
g(,) = /n I log - log |
(5) Normal distribution (fixed variance)
2
2
p(,, Q ) = N(, Q ;x), Q fixed
g(y-,9P2) = li " U 21 / c r 0
(6) Normal distribution (fixed mean)
2
2
p(x,y o )= N(Q, ;x),y0 fixed
2 2
g(-j,2) = /2 I log 1 - log ^
(7) Normal distribution
2
2
p(x5y; ) = N(y, x ) , and both variable.
The information metric in this case is
ds

dy!+2d^

(6.5)

and the geodesic distance is

tanh"](l,2)

(6.6)

236

C. R. Rao

where (l,2) is the positive square root of


/

\C.,r\f

\L.

The explicit form (6.6) is given in Burbea and Rao (1982a).

From (6.6)

g(y 5 1 ;, 2 ) = /2|log 1 - log ^\


2
2
which agrees with result (6). However, g(,, ;y2, ) does not reduce to result
(7) since = constant is not a geodesic curve with respect to the metric (6.5)
(8) Multivariate normal distribution
N (,;x), fixed
g ( y r y 2 ) = C(y 1 -y 2 ) l " 1 (y ] -y 2 )] i s
which is Mahalanobis distance.
(9) Multivariate normal distribution
N(,;x), fixed
g ( r ? ) - [ 2 " I (log . ) 2 ] ^
1
where 0 < , <-< are the roots of the determinantal equation | 2 - X , | = 0.
The above explicit form is due to S. T. Jensen as mentioned in Atkinson and
Mitchell (1981).
(10)

Negative binomial distribution


p(x,) = [x!r(r)]" 1 r(x+r) x (l-) r , r fixed
-1
1 - /(b
g(,) = 2/r cosh
*
= 2/r log

This computation is due to Oiler and Cuadras (1985).


(11) Multinomial distribution
n

n!
l
k
p ( n r . . . , n k ; .,,...,^) = ^ ^ . . n ^ - --\

n fixed.

Let ^ = ( ^ ,... ,^^) and 2 = ( i2 9 '' # ' k2^


k

g(

r 2

COS

1
The above computation was originally done by Rao (1945), but an easier method

Differential Metrics in Probability Spaces

237

of derivation is given by Atkinson and Mitchell (1981).


Recently Burbea (1984) obtained geodesic distances in the case of
independent Poisson and Normal distributions which are given below. These
results (12) and (13) follow directly from (1) and (7) respectively as the
squared geodesic distances behave additively under combination of independent
distributions.
(12) Independent Poisson distributions
-

n
P ( X

1'

>V 1

TT ~

Ol

n
g(1,...,n;1,...,n) = 2[(/i -

(13) Independent Normal distributions


N(x;u 2 )...N(x n ; n , 2 )

_ 1+6.(1,2) 1 1

k=l

l-k(l,2)

= rt [ I log2

where 6.(1,2) is the positive square root of


( k - k 2 ) + 2 ( k k 2 )
(

(14)

k k 2 } +2( kl + k2 )

Multivariate elliptic distributions

p(x|y,) = | 1 / 2 h[( X -)'" 1 (x-)],


for some function h, and is fixed

where c. is a constant, which is essentially Mahalanobis distance. This result


is due to Mitchell and Krzanowski (1985).
The use of the c... coefficients defined in (2.4) and (4.2) in the
1J K

discussion of statistical problems will be considered in a future communication.

REFERENCES

Amari, S. I. (1982). Differential geometry of curved exponential families curvature and information loss. Ann. Stat. 10, 357-385.
Amari, S. I. (1983). A foundation of information geometry. Electronics and
Communications in Japan 66-A, 1-10.
Atkinson, C. and Mitchell, A. F. S. (1981). Rao's distance measure. Sankhya
43, 345-365.
Burbea, J. (1986).

Informative geometry in probability spaces. Expo. Math.

1, 347-378.
Burbea, J. and Rao, C. Radhakrishna (1982a). Entropy differential metric,
distance and divergence measures in probability spaces: a unified
approach. J. Multivariate Anal. 12, 575-596.
Burbea, J. and Rao, C. Radhakrishna (1982b). Differential metrics in probability spaces. Probability Math. Statist. 3, 115-132.
Cencov, N. N. (1982). Statistical decision rules and optimal inference.
Transactions of Mathematical Monographs 53, Amer. Math. Soc.,
Providence.
Csiszar, I. (1967).

Information-type measures of difference of probability

distributions and indirect observations. Studia Scientiarum


Mathematicarum Hungrica 2, 299-318.
Efron, B. (1975). Defining the curvature of a statistical problem (with
applications to second order efficiency, with discussion). Ann.
Statist. 3, 1189-1217.

238

Differential Metrics in Probability Spaces

239

Efron, B. (1982). Maximum likelihood decision theory. Ann. Statist. K),


340-356.
Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a
curved exponential family. Ann. Statist. 11, 793-803.
Eguchi, S. (1984). A differential geometric approach to statistical inference
on the basis of contrast functionals. Tech. Report No. 136,
Hiroshima University, Hiroshima, Japan.
Havrda, M. E. and Charvat, F. (1967). Quantification method of classification
processes: Concept of -entropy. Kybernetika 3, 30-35.
Kass, R. E. (1980). The Riemannian structure of model spaces: a geometrical
approach to inference. Ph.D. thesis, University of Chicago.
Kass, R. E. (1981). The geometry of asymptotic inference. Tech. Rept. 215.
Dept. of Statistics, Carnegie-Mellon University.
Lau, Ka-Sing

(1985). Characterization of Rao's quadratic entropy. Sankhya A


47, 295-309.

Mitchell, A. F. S. and Krzanowski, W. J. (1985). The Mahalanobis distance and


elliptic distributions.

(To appear in Biometrika).

Nei, M. (1978). The theory of genetic distance and evolution of human races.
Japan J. Human Genet. 2!3, 341-369.
Oiler, J. M. and Cuadras, C. M. (1985). Rao's distance for negative multinomial distributions. Sankhya 47, 75-83.
Rao, C. Radhakrishna (1945).

Information and accuracy attainable in the estima-

tion of statistical parameters. Bull. Calcutta Math. Soc. 37,


81-91.
Rao, C. Radhakrishna (1948). The utilization of multiple measurements in problems of biological classification (with discussion). J. Roy.
Statist. Soc. BIO, 159-203.
Rao, C. Radhakrishna (1949). On the distance between two populations. Sankhya
.9, 246-248.
Rao, C. Radhakrishna (1954). On the use and interpretation of distance

240

C. R.

Rao

functions in statistics. Bull. Inst. Inter. Statist. 34, 90-100.


Rao, C. Radhakrishna (1962). Efficient estimates and optimum inference procedures in large samples (with discussion). J. Roy. Statist. Soc.
13 24, 46-72.
Rao, C. Radhakrishna (1973). Linear Statistical Inference and its Applications.
(Second edition) Wiley, New York.
Rao, C. Radhakrishna (1982a). Diversity and dissimilarity coefficients: a
unified approach. J. Theoret. Pop. Biology 21, 24-43.
Rao, C. Radhakrishna (1982b). Diversity: its measurement, decomposition,
apportionment and analysis. Sankhya A 44, 1-22.
Rao, C. Radhakrishna (1984). Convexity properties of entropy functions and
analysis of diversity.

In Inequalities in Statistics and

Probability, IRS Lecture Notes, Vol. 5, 68-77.


Rao, C. Radhakrishna and Nayak, T. K. (1985). Cross entropy, dissimilarity
measures and characterizations of quadratic entropy.

IEEE Trans.

Information Theory IT 31, 589-593.


Shahshahani, S. (1979). A new mathematical framework for the study of linkage
and selection. Memoirs of the American Mathematical Society,
No. 211.

S-ar putea să vă placă și