Documente Academic
Documente Profesional
Documente Cultură
org/wiki/Information_geometry
Information geometry
From Wikipedia, the free encyclopedia
Information geometry is a branch of mathematics that applies the techniques of differential geometry to the field of probability theory. This is done by taking probability distributions
for a statistical model as the points of a Riemannian manifold, forming a statistical manifold. The Fisher information metric provides the Riemannian metric.
Information geometry reached maturity through the work of Shun'ichi Amari and other Japanese mathematicians in the 1980s. Amari and Nagaoka's book, Methods of Information
Geometry,[1] is cited by most works of the relatively young field due to its broad coverage of significant developments attained using the methods of information geometry up to the
year 2000. Many of these developments were previously only available in Japanese-language publications.
Contents
1 Introduction
1.1 Information and probability
1.2 Statistical model, Parameters
1.3 Differential geometry applied to probability
1.4 Tangent space
1.5 alpha representation
1.6 Inner product
1.7 Fisher metric as inner product
1.8 Affine connection
1.9 Alpha connection
1.10 Divergence
1.11 Canonical divergence
1.12 Properties of divergence
1.13 Canonical divergence for the exponential family
1.14 Canonical divergence for general alpha families
2 History
3 Applications
4 See also
5 References
6 Further reading
7 External links
Introduction
The following introduction is based on Methods of Information Geometry.[1]
Define an n-set to be a set V with cardinality . To choose an element v (value, state, point, outcome) from an n-set V, one needs to specify b-sets, if one disregards all
but the cardinality. That is, nats of information are required to specify v; equivalently, bits are needed.
By considering the occurrences of values from , one has an alternate way to refer to , through . First, one chooses an occurrence , which requires information of
bits. To specify v, one subtracts the excess information used to choose one from all those linked to , this is . Then, is the number of
portions fitting into . Thus, one needs bits to choose one of them. So the information (variable size, code length, number of bits) needed to refer to , considering
Finally, is the normalized portion of information needed to code all occurrences of one . The averaged code length over all values is . is
called the entropy of a random variable .
With a probability distribution one looks at a variable through an observation context like a message or an experimental setup.
The context can often be identified by a set of parameters through combinatorial reasoning. The parameters can have an arbitrary number of dimensions and can be very local or less
so, as long as the context given by a certain produces every value of , i.e. the support does not change as function of . Every determines one probability
distribution for . Basically all distributions for which there exists an explicit analytical formula fall into this category (Binomial, Normal, Poisson, ...). The parameters in these cases
have a concrete meaning in the underlying setup, which is a statistical model for the context of .
The parameters are quite different in nature from itself, because they do not describe , but the observation context for .
with
and ,
that mixes different distributions , is called a mixture distribution, mixture or -parameterization or mixture for short. All such parameterizations are related through an affine
1 de 6 7/7/17 13:18
Information geometry - Wikipedia https://en.wikipedia.org/wiki/Information_geometry
A flat parameterization for is an exponential or parameterization, because the parameters are in the exponent of . There are several
important distributions, like Normal and Poisson, that fall into this category. These distributions are collectively referred to as exponential family or -family. The -manifold for such
distributions is not affine, but the manifold is. This is called -affine. The parameterization for the exponential family can be mapped to
the one above by making another parameter and extend .
In information geometry, the methods of differential geometry are applied to describe the space of probability distributions for one variable . This is done by using a coordinate or
atlas . Furthermore, the probability must be a differentiable and invertible function of . In this case, the are coordinates of the -space, and the latter is a
differential manifold .
Given a function on , one may "geometrize" it by taking it to define a new manifold. This is done by defining coordinate functions on this new manifold as
In this way one "geometricizes" a function , by encoding it into the coordinates used to describe the system.
For the inverse is and the resulting manifold of points is called the -representation. The manifold itself is called the -representation. The - or
-representations, in the sense used here, does not refer to the parameterization families of the distribution.
Tangent space
In standard differential geometry, the tangent space on a manifold at a point is given by:
In ordinary differential geometry, there is no canonical coordinate system on the manifold; thus, typically, all discussion must be with regard to an atlas, that is, with regard to functions
on the manifold. As a result, tangent spaces and vectors are defined as operators acting on this space of functions. So, for example, in ordinary differential geometry, the basis vectors
of the tangent space are the operators .
However, with probability distributions , one can calculate value-wise. So it is possible to express a tangent space vector directly as ( -representation ) or (
-representation ), and not as operators.
alpha representation
Important functions of are coded by a parameter with the important values , and :
mixed or -representation ( ):
exponential or -representation ( ): )
-representation ( ): ( )
Distributions that allow a flat parameterization are called collectively -family ( -, - or -family ) of distributions and the according manifold is called
-affine.
Inner product
One may introduce an inner product on the tangent space of manifold at point as a linear, symmetric and positive definite map
This allows a Riemannian metric to be defined; the resulting manifold is a Riemannian manifold. All of the usual concepts of ordinary differential geometry carry over, including the
norm
the line element , the volume element , and the cotangent space
that is, the dual space to the tangent space . From these, one may construct tensors, as usual.
For probability manifolds such an inner product is given by the Fisher information metric.
2 de 6 7/7/17 13:18
Information geometry - Wikipedia https://en.wikipedia.org/wiki/Information_geometry
is the Hellinger distance applicable to the -family. also evaluates to the Fisher metric.
Affine connection
Like commonly done on Riemann manifolds, one may define an affine connection (or covariant derivative)
Given vector fields and lying in the tangent bundle , the affine connection describes how to differentiate the vector field along the direction . It is itself a vector
field; it is the sum of the infinitesimal change in the vector field , as one moves along the direction , plus the infinitesimal change of the vector due to its parallel transport along
the direction . That is, it takes into account the changing nature of what it means to move a coordinate system in a "parallel" fashion, as one moves about in the manifold. In terms of
the basis vectors , one has the components:
The are Christoffel symbols. The affine connection may be used for defining curvature and torsion, like is usual in Riemannian geometry.
Alpha connection
A non-metric connection is not determined by a metric tensor ; instead, it is and restricted by the requirement that the parallel transport between points and must be a
linear combination of the base vectors in . Here,
expresses the parallel transport of as linear combination of the base vectors in , i.e. the new minus the change. Note that it is not a tensor (does not transform as a tensor).
For the mentioned -families the affine connection is called the -connection and can also be expressed in more ways.
For :
3 de 6 7/7/17 13:18
Information geometry - Wikipedia https://en.wikipedia.org/wiki/Information_geometry
,
i.e. 0-affine, and hence , i.e. 1-affine.
Divergence
A function of two distributions (points) with minimum for entails and . is applied only to the first
parameter, and only to the second. is the direction, which brought the two points to be equal, when applied to the first parameter, and to diverge again, when applied to the second
parameter, i.e. . The sign cancels in , which we can define to be a metric , if always positive.
The absolute derivative of along yields candidates for dual connections . This metric and the connections relate to the
Taylor series expansion for the first parameter or second parameter. Here for the first parameter:
The term is called the divergence or contrast function. A good choice is with convex for . From Jensen's inequality it follows that
which is the Kullback-Leibler divergence or relative entropy applicable to the -families. In the above,
Canonical divergence
We now consider two manifolds and , represented by two sets of coordinate functions and . The corresponding tangent space basis vectors will be denoted by
and . The bilinear map associates a quantity to the dual base vectors. This defines an affine connection for and affine connection for
that keep constant for parallel transport of and , defined through and .
If is flat, then there exists a coordinate system , that does not change over . In order to keep constant, must not change either, i.e. is also flat. Furthermore, in this
case, we can choose coordinate systems such that
If results as a function on , then making , both coordinate system function sets describe . The connections are such, though, that makes flat and makes
flat. This dual space is denoted as .
Because of the linear transform between the flat coordinate systems, we have and .
Because and so for it is possible to define two potentials and through and ( Legendre transform ).These are
and .
Then
and
.
4 de 6 7/7/17 13:18
Information geometry - Wikipedia https://en.wikipedia.org/wiki/Information_geometry
Properties of divergence
The meaning of the canonical divergence depends on the meaning of the metric and vice versa ( ). For the metric (Fisher metric) with the
dual connections this is the relative entropy. For the self-dual Euclidean space leads to
The last part drops in case of dual flatness. is the exponential map.
Pythagorean Theorem: For and meeting on orthogonal lines at ( )
For and with a -autoparallel sub-manifold implies that the -geodesic connecting and is orthogonal to .
By projecting onto of a curve one can calculate
and with .
For an autoparallel sub-manifold parallel transport in it can be expressed with the sub-manifold's base vectors, i.e. . A one-dimensional autoparallel sub-manifold is a
geodesic.
For the exponential family one has . Applying on both sides yields
. The other potential ( is entropy,
and was used). is the covariance of , the CramrRao bound, i.e. an efficient
estimator must be exponential.
The canonical divergence is given by the Kullback-Leibler divergence and the triangulation is
.
The minimal divergence to a sub-manifold given by a restriction like some constant means maximizing . With this corresponds to the maximum entropy
principle.
The connection induced by the divergence is not flat unless . Then the Pythagorean theorem for two curves intersecting orthogonally at is:
History
The history of information geometry is associated with the discoveries of at least the following people, and many others
5 de 6 7/7/17 13:18
Information geometry - Wikipedia https://en.wikipedia.org/wiki/Information_geometry
Solomon Kullback
Jean-Louis Koszul
Richard Leibler
Claude Shannon
Imre Csiszr
N. N. Cencov (also written as Chentsov)
Bradley Efron
Paul Vos
Shun'ichi Amari
Hiroshi Nagaoka
Robert Kass
Shinto Eguchi
Ole Barndorff-Nielsen
Frank Nielsen
Giovanni Pistone
Bernard Hanzon
Damiano Brigo
Applications
Information geometry can be applied where parametrized distributions play a role.
statistical inference
time series and linear systems
quantum systems
neural networks
machine learning
statistical mechanics
biology
statistics
mathematical finance
See also
Ruppeiner geometry
References
1. Shun'ichi Amari, Hiroshi Nagaoka - Methods of information geometry, Translations of mathematical monographs; v. 191, American Mathematical Society, 2000 (ISBN 978-0821805312)
Further reading
Shun'ichi Amari, Hiroshi Nagaoka (2000) Methods of Information Geometry, Translations of Mathematical Monographs; v. 191, American Mathematical Society, (ISBN
978-0821805312)
Shun'ichi Amari (1985) Differential-geometrical methods in statistics, Lecture notes in statistics, Springer-Verlag, Berlin.
M. Murray and J. Rice (1993) Differential geometry and statistics, Monographs on Statistics and Applied Probability 48, Chapman and Hall.
R. E. Kass and P. W. Vos (1997) Geometrical Foundations of Asymptotic Inference, Series in Probability and Statistics, Wiley.
N. N. Cencov (1982) Statistical Decision Rules and Optimal Inference, Translations of Mathematical Monographs; v. 53, American Mathematical Society
Giovanni Pistone, and Sempi, C. (1995). "An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one", Annals of
Statistics. 23(5): 15431561.
Brigo, D, Hanzon, B, Le Gland, F. (1999) "Approximate nonlinear filtering by projection on exponential manifolds of densities", Bernoulli 5: 495 - 534, ISSN 1350-7265
(https://www.worldcat.org/search?fq=x0:jrnl&q=n2:1350-7265)
Brigo, D, (1999) "Diffusion Processes, Manifolds of Exponential Densities, and Nonlinear Filtering", in Ole E. Barndorff-Nielsen and Eva B. Vedel Jensen, editors, Geometry in
Present Day Science (http://www.worldscientific.com/worldscibooks/10.1142/3958), World Scientific
Arwini, Khadiga, Dodson, C. T. J. (2008) Information Geometry - Near Randomness and Near Independence (http://www.springer.com/mathematics/geometry
/book/978-3-540-69391-8), Lecture Notes in Mathematics # 1953, Springer ISBN 978-3-540-69391-8
Th. Friedrich (1991) "Die Fisher-Information und symplektische Strukturen", Mathematische Nachrichten 153: 273-296.
External links
Information Geometry (http://bactra.org/notebooks/info-geo.html) overview by Cosma Rohilla Shalizi, July 2010
Information Geometry (http://math.ucr.edu/home/baez/information/) notes by John Baez, November 2012
Information geometry for neural networks(pdf ) (http://www.its.caltech.edu/~daw/papers/98-Wage2.pdf), by Daniel Wagenaar
Categories: Differential geometry Information theory Theory of probability distributions Category theory
6 de 6 7/7/17 13:18