One of the most profound shifts in modern machine learning theory is the recognition that optimization is, at its heart, a geometric problem. When we train a neural network, we are not simply descending a hill in some abstract parameter space — we are navigating a curved manifold whose local geometry is dictated by the statistical structure of the model itself. This geometric perspective, the province of information geometry, transforms how we think about learning, generalization, and the nature of neural computation.
Information geometry, developed systematically by Shun-ichi Amari and his collaborators from the 1980s onward, studies the differential-geometric structure of families of probability distributions. The central insight is deceptively simple: given a statistical model — a parametric family of distributions p(x∣θ) — the parameter space Θ is not a flat Euclidean space but a curved Riemannian manifold. The metric that endows it with this curvature is the Fisher information matrix, a quantity of fundamental importance in both statistics and physics.
For neural networks, this viewpoint is particularly illuminating. Every neural network with a probabilistic output layer — a classifier, a language model, a variational autoencoder — defines a parametric family of distributions over its outputs. The billions of parameters specifying the network's weights are coordinates on a high-dimensional statistical manifold. The process of training is the process of moving along this manifold toward a region where the model's distribution closely matches the data-generating distribution.
Standard gradient descent ignores this geometry. It treats all directions in parameter space as equally significant, implicitly assuming a flat Euclidean metric. But the true metric — the Fisher metric — stretches and compresses the space in ways that depend on the curvature of the model. When we correct for this curvature using the natural gradient, optimization becomes faster, more principled, and more aligned with the intrinsic statistical structure of the problem.
This article develops these ideas in detail. We begin with the definition of statistical manifolds and the Fisher information metric, establish their foundational properties (including Chentsov's remarkable uniqueness theorem), and connect them to information-theoretic quantities such as the Kullback–Leibler divergence. We then apply this machinery to neural networks — examining the geometry of their parameter spaces, the natural gradient algorithm, and the role of curvature in learning dynamics. We conclude with a survey of modern research directions and the challenges and opportunities that remain.
Statistical Manifolds
Parametric Families as Geometric Objects
A statistical model is a parametric family of probability distributions:
M={p(x∣θ)∣θ∈Θ⊆Rn}
where x ranges over some sample space X and θ=(θ1,θ2,…,θn) are the parameters. The classical examples include the family of Gaussian distributions parameterized by mean μ and variance σ2, the family of Bernoulli distributions parameterized by p∈(0,1), and exponential families of all kinds.
The transition from statistics to geometry occurs when we recognize M as a differentiable manifold. Each point θ∈Θ corresponds to a distinct probability distribution p(⋅∣θ), and the parameter coordinates (θ1,…,θn) serve as a local chart on the manifold. Under mild regularity conditions — that p(x∣θ) is smooth in θ, that the support of p does not depend on θ, and that the model is identifiable (distinct θ yield distinct distributions) — the family M has the structure of an n-dimensional manifold.
The tangent space at a point θ is spanned by the score functions:
∂ilogp(x∣θ)=∂θi∂logp(x∣θ),i=1,…,n.
These score functions have a beautiful statistical interpretation: they measure how sensitively the log-likelihood changes as we move in the i-th parameter direction. Under the regularity conditions above, they satisfy:
Eθ[∂ilogp(X∣θ)]=0
for all i, a fact that follows immediately from differentiating ∫p(x∣θ)dx=1 under the integral sign.
From Euclidean to Riemannian
In ordinary Euclidean geometry, the distance between two nearby points θ and θ+dθ is given by ∥dθ∥2=∑idθi2. But this flat metric is geometrically arbitrary — it has no special relationship to the statistical content of the model. Two parameter configurations that are numerically close may correspond to very different probability distributions, or vice versa. What we want is a metric that captures statistical distinguishability: two distributions should be close if and only if they are difficult to tell apart based on data.
This desire motivates a Riemannian structure on M, where the metric tensor at each point θ provides a quadratic form on the tangent space:
ds2=i,j∑gij(θ)dθidθj.
The question is: what should gij(θ) be? The answer — essentially unique, as Chentsov's theorem will confirm — is the Fisher information matrix.
Fisher Information and the Riemannian Metric
The Fisher Information Matrix
The Fisher information matrix (FIM) is defined as:
Iij(θ)=Eθ[∂θi∂logp(X∣θ)⋅∂θj
or equivalently, under regularity:
Iij(θ)=−Eθ[∂θi∂θj∂2logp(X∣θ)].
The equivalence of these two forms — the outer-product form and the Hessian form — is itself a non-trivial identity, following from differentiating the normalization condition twice. Both forms appear throughout the literature and each has computational and conceptual advantages.
The FIM is always positive semi-definite: for any vector v∈Rn,
vTI(θ)v=Eθ(i∑vi∂θi∂logp(X∣θ)≥0,
and under identifiability it is positive definite, making it a valid Riemannian metric tensor:
gij(θ)=Iij(θ).
The Cramér–Rao Bound
The Fisher information metric has a direct statistical consequence: the Cramér–Rao inequality. For any unbiased estimator θ^(X1,…,Xm) of θ from m i.i.d. samples, the covariance matrix of the estimator satisfies:
Covθ(θ^)⪰m1I(θ)−1,
where ⪰ denotes the Loewner partial order on positive semidefinite matrices. The inverse Fisher matrix thus gives a lower bound on statistical uncertainty: the geometry of the manifold limits how precisely we can estimate parameters. Distributions that are close in Fisher metric are hard to distinguish; estimating which one generated the data requires many samples.
Geometric Interpretation
The Fisher metric endows M with a Riemannian geometry whose invariants — distances, geodesics, curvatures — all have statistical interpretations.
Distances between distributions measure their statistical dissimilarity: the Fisher-Rao distance between p(⋅∣θ) and p(⋅∣θ′) is the length of the shortest geodesic connecting θ and θ′ on the manifold. It is a true intrinsic distance, invariant under reparameterization.
Geodesics are the straightest paths on the manifold — the analogue of straight lines in Euclidean space. Their behavior determines how parameter trajectories under gradient-based optimization relate to the intrinsic geometry of the model.
Curvature captures how the manifold deviates from flatness and controls the extent to which linear approximations fail. High curvature near a particular θ indicates that the local statistical behavior is strongly nonlinear.
The Fisher–Rao Geometry
Natural Distance Between Distributions
The Fisher–Rao metric refers specifically to the Riemannian structure on a statistical manifold induced by the Fisher information. Given a smooth curve θ(t), t∈[0,1], connecting two distributions, its length under the Fisher–Rao metric is:
ℓ[θ]=∫01θ˙(t)TI(θ(t))θ˙(t)dt.
The geodesic distance between θ0 and θ1 is the infimum of this functional over all connecting curves:
dFR(θ0,θ1)=θinfℓ[θ].
For many classical families, the Fisher–Rao geodesic distance has a closed form. For univariate Gaussians parameterized by (μ,σ) with σ>0, the parameter space becomes the Poincaré upper half-plane with the hyperbolic metric — one of the most celebrated spaces in geometry — and the Fisher–Rao distance is exactly the hyperbolic distance:
dFR((μ1,σ1),(μ2,σ2))=2arcosh(1+4σ1σ2
This is a striking result: the space of Gaussian distributions is not flat but negatively curved, with the curvature of the hyperbolic plane. Negative curvature means that triangles are "thinner" than in Euclidean space — a geometric signature with deep consequences for statistical inference.
Exponential Families and Dual Coordinates
For exponential families — the most tractable and important class of statistical models — the Fisher–Rao geometry has an exceptionally clean structure. An exponential family takes the form:
p(x∣θ)=h(x)exp(i∑θiTi(x)−A(θ)),
where θ are the natural parameters, Ti(x) are sufficient statistics, and A(θ)=log∫h(x)eθ⋅T(x)dx is the log-partition function (cumulant-generating function).
For exponential families, the Fisher information matrix takes the form:
Iij(θ)=∂θi∂θj∂2A(θ)=Covθ(Ti(X),Tj(X)).
The manifold admits two dual coordinate systems: the natural parameters θ and the expectation parametersηi=Eθ[Ti(X)]=∂iA(θ). These two coordinate systems are related by a Legendre transform, and they define two dual flat connections on the manifold — the e-connection (exponential) and m-connection (mixture) — a structure central to Amari's framework of dually flat manifolds.
Chentsov's Theorem
The Uniqueness of the Fisher Metric
A natural question arises: is the Fisher information metric merely a convenient choice, or is it the canonical metric on statistical manifolds? The answer is provided by a landmark result.
Theorem (Chentsov, 1972).Let M be a finite statistical manifold, and let g be any Riemannian metric on M that is invariant under all sufficient statistics (i.e., under all Markov morphisms / congruent embeddings). Then g is proportional to the Fisher information metric.
More precisely: Chentsov showed that if we require the metric to be invariant under sufficient statistics — transformations of the data that preserve all statistical information about θ — then up to a positive constant, the Fisher metric is the unique such metric.
This theorem establishes the Fisher metric's canonical status. It is not an ad hoc choice but a mathematical necessity: any statistically sensible notion of distance on a family of probability distributions must be the Fisher–Rao distance. The geometry of inference is fixed by the logic of statistical sufficiency.
Amari and Nagaoka later extended this uniqueness result to the full family of α-connections — a one-parameter family of affine connections on the statistical manifold — showing that the Fisher metric paired with the ±α connections (of which the e- and m-connections are the ±1 cases) are the unique objects invariant under Markov morphisms.
Implications for Statistical Theory
Chentsov's theorem has several deep implications:
Intrinsic inference: The geometry of statistical inference is intrinsic to the model — it does not depend on how we parameterize the family. Any sufficient reparameterization preserves the Fisher metric structure.
Natural divergences: The divergences consistent with the Fisher geometry — the family of f-divergences and, more generally, Bregman divergences on dually flat manifolds — are natural objects of statistical theory, not mere conventions.
Universality: The Fisher metric appears across probability theory, statistical physics, quantum mechanics, and information theory for precisely this reason: it is the canonical measure of statistical dissimilarity.
Connections to Information Theory
The Kullback–Leibler Divergence
The Kullback–Leibler divergence (or relative entropy) from distribution P to distribution Q is:
DKL(P∥Q)=∫p(x)logq(x)p(x)dx.
The KL divergence is not a true distance — it is asymmetric and does not satisfy the triangle inequality — but it is a divergence: it satisfies DKL(P∥Q)≥0 with equality if and only if P=Q.
The connection to the Fisher metric emerges from its Taylor expansion. Consider two nearby distributions p(x∣θ) and p(x∣θ+dθ) on a statistical manifold. A direct computation yields:
This is the fundamental link between information theory and Riemannian geometry: the Fisher metric is exactly the second-order approximation of the KL divergence. Locally, the cost of "moving" from one distribution to a nearby one — measured in information — is quadratic in the parameter displacement, with the Fisher matrix as the quadratic form.
This formula explains why the KL divergence is so natural in machine learning: minimizing KL divergence (equivalently, maximum likelihood) is equivalent to finding the distribution in M that minimizes the Fisher-metric distance to the empirical distribution. The loss function of deep learning has a geometric interpretation built into its very definition.
The α-Divergences and Dual Geometry
The KL divergence is a special case (α=1) of the α-divergences:
D(α)(P∥Q)=1−α24(1−∫p(x)21−αq(x)2,α=±1.
All α-divergences share the same second-order approximation — the Fisher metric — but differ in their higher-order terms and hence in their curvature properties. Each α-divergence induces a pair of dual connections (∇(α),∇(−α)) on the manifold, with the Levi-Civita connection (of the Fisher metric) arising as their midpoint at α=0.
This dual structure — a manifold equipped with a metric and two mutually dual connections — is the central object of information geometry. On dually flat manifolds (exponential families with their e- and m-connections), the dual structure gives rise to a generalized Pythagorean theorem for KL divergences:
DKL(P∥R)=DKL(P∥Q)+DKL(Q∥R)
when Q is the m-projection of R onto a flat submanifold containing P. This identity underlies the EM algorithm, belief propagation, and many other inference algorithms.
Information Geometry of Neural Networks
Neural Networks as Statistical Manifolds
A neural network with a probabilistic output defines a conditional distribution p(y∣x,θ), where x is the input, y is the output, and θ∈Rd is the vector of all weights and biases (d may be in the billions for modern architectures). The space of all such networks — indexed by their parameter vectors — constitutes a high-dimensional statistical manifoldM={p(⋅∣⋅,θ):θ∈Rd}.
For a classifier with softmax output, the distribution over classes is:
p(y=k∣x,θ)=∑j=1Kefj(x,θ)efk(x,θ),
where fk(x,θ) is the k-th logit. For a regression model with Gaussian noise:
p(y∣x,θ)=N(y;f(x,θ),σ2).
In both cases, the Fisher information matrix at θ is:
This matrix — of size d×d — is the metric tensor of the neural network manifold. Its eigenvalues and eigenvectors describe the principal directions of statistical sensitivity in parameter space. Directions of high curvature (large eigenvalues) correspond to parameter changes that strongly alter the model's output distribution; directions of low curvature (small eigenvalues) correspond to near-degeneracies where many different parameter configurations yield essentially identical distributions.
Symmetries and Degeneracies
Neural network manifolds are characterized by extensive symmetries that render the FIM degenerate or near-degenerate along large subspaces. For feedforward networks, permutation symmetry (reordering neurons in a layer leaves the function unchanged) and sign symmetry (flipping the signs of all weights entering and leaving a neuron) create combinatorially large equivalence classes of parameters representing the same distribution.
Moreover, scale symmetry in ReLU networks — scaling a layer's incoming weights by λ>0 and outgoing weights by 1/λ preserves the function — generates entire one-dimensional orbits of equivalent parameters. These symmetries correspond to zero or near-zero eigenvalues of the FIM, creating flat directions in the manifold along which the likelihood is constant.
The structure of these flat directions is not merely a mathematical curiosity. It has profound implications for optimization: gradient descent can drift along flat directions without any change in loss, effective parameter counts are far lower than nominal counts, and the geometry of the loss landscape is radically different from a generic smooth function in Rd.
Natural Gradient Descent
The Problem with Standard Gradient Descent
Standard gradient descent updates parameters as:
θt+1=θt−η∇θL(θt),
where L(θ)=−N1∑n=1Nlogp(yn∣xn,θ) is the negative log-likelihood loss. This update is a step in the direction of steepest descent in the Euclidean metric on parameter space.
But the Euclidean metric has no intrinsic relationship to the statistical model. It treats all parameters as equally important and all parameter directions as equally significant — which is generically false. Consider a softmax classifier: increasing one weight by 0.01 may dramatically alter the output distribution while increasing another by 0.01 may barely affect it, depending on the local geometry.
The steepest descent direction depends on which metric we use. In a general Riemannian manifold with metric tensor gij(θ), the steepest descent direction in the local geometry is:
∇~L(θ)=G(θ)−1∇L(θ),
where G(θ)=[gij(θ)] is the metric tensor matrix. With the Fisher metric G=I(θ), this gives the natural gradient:
∇~θL(θ)=I(θ)−1∇θL(θ).
The Natural Gradient Algorithm
Amari's natural gradient descent algorithm is:
θt+1=θt−ηI(θt)−1∇θL(θt).
This update performs steepest descent in the Fisher–Rao metric rather than the Euclidean metric. It is invariant under reparameterization: if we change coordinates θ↦ϕ(θ), the natural gradient algorithm takes the same step in distribution space, whereas ordinary gradient descent gives a different trajectory in each parameterization.
The natural gradient can be derived from a variational principle. The standard gradient step solves:
θt+1=argθmin∇L(θt)T(θ−θt)+2η1∥θ−θt∥2.
The natural gradient step replaces the Euclidean distance with the Fisher distance:
where we use the second-order KL approximation DKL≈21(θ−θt)TI(θt)(θ−θt). The natural gradient is therefore the information-geometrically optimal parameter update: it takes the steepest step in distribution space, not in parameter space.
Convergence Properties
The natural gradient has well-studied convergence properties for exponential families. For a model in M and data generated by a distribution p∗∈M, the natural gradient descent on the KL divergence converges at a rate independent of the parameterization — precisely because it is intrinsic to the manifold.
More concretely, in a neighborhood of the true parameter θ∗, the loss surface looks like a paraboloid in the Fisher metric (the Cramér–Rao bound guarantees this). Natural gradient descent on a paraboloid is equivalent to Newton's method: the update θt+1=θt−ηI−1∇L is the Newton step when I equals the Hessian of L. For maximum likelihood estimation, by the Fisher identity, the Fisher matrix is the expected Hessian, making the natural gradient a stochastic approximation to Newton's method.
This is the fundamental advantage of the natural gradient: near the optimum, it achieves quadratic convergence while standard gradient descent achieves only linear convergence. In practice this translates to dramatically fewer training steps for a given accuracy.
Practical Implications: K-FAC and Approximations
The exact natural gradient is computationally impractical for large neural networks: the FIM is d×d with d∼109 for modern models, making storage and inversion infeasible. Several practical approximations have been developed.
K-FAC (Kronecker-Factored Approximate Curvature), due to Martens and Grosse (2015), exploits the layered structure of neural networks. For a linear layer with weight matrix W∈Rm×n, the FIM restricted to that layer can be approximated as a Kronecker product:
IW≈A⊗S,
where A=E[aaT] is the covariance of the layer's input activations a and S=E[δδT] is the covariance of the pre-activation gradients δ. The Kronecker structure makes the inverse tractable: (A⊗S)−1=A−1⊗S−1, reducing the cost from O(d3) to O(m3+n3) per layer.
K-FAC has demonstrated impressive results in supervised learning, reinforcement learning, and variational inference, often achieving 3–10× acceleration over standard Adam in terms of training steps required for a given loss.
Curvature of Statistical Models
Sectional Curvature and Learning
The Riemann curvature tensor of a statistical manifold measures the extent to which the manifold deviates from flat space. For a two-dimensional subspace spanned by tangent vectors u,v at θ, the sectional curvature is:
K(u,v)=∣u∣2∣v∣2−⟨u,v⟩2⟨R(u,v)v,u⟩,
where R is the Riemann curvature tensor of the Fisher metric. Positive curvature means the manifold locally resembles a sphere; negative curvature means it locally resembles a hyperbolic space.
For statistical models, the curvature has a concrete interpretation: it measures how much first-order approximations to the likelihood function fail. High curvature in a region of M implies that gradient-based methods will oscillate or overshoot, while low curvature implies smooth, nearly linear convergence.
For exponential families, the curvature under the mixture connection ∇(m) is identically zero — the manifold is m-flat. This flatness is what makes the EM algorithm (which performs m-projections and e-projections alternately) geometrically clean: each step lies on a flat submanifold.
Geodesic Distance and Generalization
An emerging line of research connects the geodesic structure of the neural network manifold to generalization. Intuitively, two sets of weights that are far apart in geodesic distance represent models that are statistically very different — even if they have similar training losses — while weights that are close in geodesic distance represent models with similar predictive distributions.
Flat minima of the loss landscape (regions of low Fisher information norm ∥I(θ)∥) correspond to regions of the manifold where the metric is nearly degenerate — where the model is statistically insensitive to parameter changes. Such minima are observed empirically to generalize better, a finding consistent with information-geometric intuition: a model that is uncertain about its own parameters (in the Fisher sense) is less likely to be overfit to training-set noise.
The sharpness of a minimum — typically measured by the largest eigenvalue of the Hessian of the loss — is approximated by the largest eigenvalue of the Fisher matrix for maximum-likelihood models. Sharpness-aware optimization methods such as SAM (Sharpness-Aware Minimization) can thus be reinterpreted as seeking flat regions of the statistical manifold, not merely of the loss landscape.
Applications in Machine Learning
Neural Network Optimization
The natural gradient and its approximations (K-FAC, EKFAC, FOOF) are the primary applications of information geometry to neural network optimization. Beyond K-FAC, more recent works have explored online natural gradient methods that maintain running estimates of the FIM using techniques from Riemannian online learning.
Adam and its variants — the most widely used optimizers in deep learning — can be partially interpreted through an information-geometric lens. Adam's diagonal preconditioner vt−1/2 approximates the inverse of the diagonal Fisher, though it discards all off-diagonal correlations. The information-geometric perspective provides a principled justification for preconditioning and suggests systematic ways to improve upon Adam by capturing more of the manifold's curvature.
Variational Inference
In variational inference, one approximates an intractable posterior p(θ∣x) by a tractable variational distribution q(θ∣λ) in a chosen family, minimizing the KL divergence:
λ∗=argλminDKL(q(θ∣λ)∥p(θ∣x)).
The variational parameter space Λ (parameterizing the family q) is itself a statistical manifold with a natural Fisher metric. The natural gradient in this space gives the natural gradient variational inference algorithm, which takes Fisher-metric-optimal steps in the space of approximate posteriors. For mean-field Gaussian posteriors, the natural gradient has a particularly clean form and converges dramatically faster than standard gradient descent on the ELBO.
Generative Models: VAEs and Diffusion
Variational autoencoders (VAEs) define a generative model pθ(x∣z)p(z) and an encoder qϕ(z∣x). Both the generative and inference networks define points on statistical manifolds — the manifold of decoders and the manifold of encoders, respectively. Training involves optimization on both manifolds simultaneously.
Recent work has explored geometry-aware training of VAEs using natural gradients for both θ and ϕ, finding improved posterior collapse behavior and better-structured latent spaces. The information geometry of the latent space — in particular, the curvature of the aggregate posterior qϕ(z)=∫qϕ(z∣x)p(x)dx — is connected to properties of the learned representation.
For diffusion models, the score function ∇xlogpt(x) — the central object in score-based generative modeling — is directly related to the Fisher information of the noisy distribution pt(x). The Fisher information I(t)=Ept[∥∇xlog decreases monotonically during the forward diffusion process, providing a natural measure of the "amount of structure" remaining at noise level t.
Reinforcement Learning: Policy Gradient Methods
In reinforcement learning, the agent's policy πθ(a∣s) is a parametric family of distributions over actions. The standard policy gradient theorem gives:
∇θJ(θ)=E[t∑∇θlogπθ(at∣st)⋅Gt],
where Gt is the return from time t. The parameter space of policies is a statistical manifold with Fisher metric:
Iij(θ)=Es∼dπ,a∼πθ[∂θi∂logπθ(a∣s)⋅∂θ.
Natural policy gradient (NPG), introduced by Kakade (2001), replaces the standard gradient with the natural gradient I(θ)−1∇θJ(θ). This gives a reparameterization-invariant policy update that takes Fisher-metric-optimal steps in the space of policies. NPG is the theoretical foundation for TRPO (Trust Region Policy Optimization) and PPO (Proximal Policy Optimization) — arguably the most successful policy gradient algorithms in practice — which approximate the natural gradient via trust-region constraints.
Modern Research Directions
Spectral Properties of the Fisher Matrix
A significant body of recent research studies the eigenspectrum of the Fisher matrix (or equivalently, the Hessian of the loss) for deep neural networks. The empirical finding, robust across architectures and datasets, is that the spectrum is extremely heavy-tailed: a tiny fraction of eigenvalues are very large, while the vast majority are close to zero.
This spectral structure has profound implications:
The neural network manifold is nearly degenerate in most directions — the model is statistically insensitive to most parameter changes.
The effective dimensionality of the manifold (measured by the participation ratio of the eigenspectrum) is much smaller than the nominal parameter count.
Optimization dynamics are dominated by the top eigendirections, while the bottom eigendirections contribute degeneracy.
Recent work by Kaur, Cohen, et al. (2022) showed that the spectral bulk of the Hessian grows during training (the "edge of stability" phenomenon), providing a geometric explanation for why neural networks can train with step sizes larger than the classical stability threshold.
Neural Tangent Kernel and Linearization
The neural tangent kernel (NTK), introduced by Jacot, Gabriel, and Hongler (2018), studies neural networks in the infinite-width limit. The NTK is defined as:
K(x,x′)=∇θf(x,θ)T∇θf(x′,θ),
where f(x,θ) is the network output. In the infinite-width limit, the NTK stays constant during training, and the training dynamics reduce to kernel regression.
The NTK is closely related to the Fisher information matrix: for regression models, the empirical Fisher (computed over a batch) equals the empirical NTK. This connection means that the information-geometric structure of the parameter manifold is captured, in the infinite-width limit, by the NTK's eigenspectrum.
Neural Networks as Curves on Manifolds
A fruitful perspective views neural network training as a curveθ(t) on the statistical manifold M. Recent work by Garg et al. (2024) studied this perspective empirically, characterizing the geometric properties — length, curvature, torsion — of training trajectories under different optimizers and architectures.
Key findings include:
Adam trajectories are shorter (in Fisher distance) than SGD trajectories for the same change in loss, consistent with its more efficient use of the manifold geometry.
The curvature of the training trajectory is correlated with generalization: trajectories that bend more (in Fisher metric) tend to reach flatter minima.
The information length∫0Tθ˙(t)TI(θ(t))θ˙(t)dt of a training run may serve as a complexity measure for the learning process.
These findings suggest that the geometry of the optimization trajectory, not merely its endpoint, is relevant to the properties of the learned model — a perspective with rich potential for both theory and algorithm design.
Geometric Interpretations of Generalization
One of the deepest open questions in deep learning theory is: why do overparameterized networks generalize? Information geometry offers a new angle. The model complexity of a neural network, from an information-geometric perspective, can be measured by:
The Fisher–Rao volume of the region in parameter space consistent with the training data.
The effective rank of the Fisher matrix at the optimum.
The intrinsic dimensionality of the statistical manifold traced by the training dynamics.
These measures are qualitatively different from classical notions of complexity (VC dimension, Rademacher complexity) and may be better suited to the overparameterized regime. Indeed, recent empirical work shows that the effective Fisher rank correlates strongly with generalization gap across architectures, providing a geometry-aware complexity measure.
Challenges and Limitations
Computational Intractability of the Full FIM
The central computational challenge of information geometry in practice is the intractability of the full Fisher information matrix. For a network with d parameters, the FIM is a d×d matrix with O(d2) entries. For modern large language models with d∼1010, this is astronomically large — on the order of 1020 floating-point numbers, orders of magnitude beyond any feasible storage or computation.
Even forming a single column of the FIM — by computing the product of the FIM with a vector, i.e., the Fisher-vector product — requires computing gradients of N separate log-likelihood terms, with N the dataset size. Each Fisher-vector product costs O(Nd) operations.
Exact natural gradient descent is therefore infeasible for large-scale neural networks, and all practical applications require approximations.
Approximation Methods
Several principled approximation strategies have been developed:
Diagonal approximation: Approximate I(θ)≈diag(I(θ)), keeping only the diagonal entries. This is essentially the Adagrad/Adam strategy. It is cheap but ignores all parameter correlations.
Kronecker-factored approximations (K-FAC): As described above, approximate the FIM as a Kronecker product per layer. Captures within-layer correlations at tractable cost. The EKFAC (Eigenvalue-corrected K-FAC) variant improves accuracy by correcting for eigenvalue estimation bias.
Low-rank approximations: Represent the FIM as I(θ)≈UUT+D where U∈Rd×k captures the top-k eigendirections and D is diagonal. Effective when the FIM is spectrally dominated by a few directions.
Sketching and randomized methods: Use randomized linear algebra to estimate the FIM's action on random vectors, then approximate the inverse using techniques from numerical linear algebra.
Empirical Fisher: Replace the expectation over p(⋅∣θ) in the FIM definition with an average over the training data, using observed labels rather than model samples. This is cheaper to compute but introduces a bias that can be substantial far from the optimum.
Distributional Shift and Non-Stationarity
Information geometry assumes a fixed data distribution p(x) against which the Fisher matrix is computed. In continual learning, domain adaptation, and non-stationary reinforcement learning, the data distribution changes over time, meaning the manifold geometry is itself a moving target. Natural gradient methods designed for a fixed distribution may be poorly calibrated for the current geometry.
This challenge has motivated work on online information geometry — methods that maintain running estimates of the FIM that adapt to the current data distribution — and connections with Bayesian online learning, where the posterior over parameters tracks the FIM through the Laplace approximation.
The Future of Information Geometry in AI
Geometric Deep Learning
The nascent field of geometric deep learning studies neural networks that respect the symmetries and geometry of their input spaces. Group-equivariant networks, graph neural networks, and neural operators on manifolds all exploit non-Euclidean geometric structure in the data. Information geometry provides a complementary perspective: the parameter spaces of these models are themselves non-Euclidean, and natural gradient methods tailored to their geometry could yield significant improvements.
Information-Theoretic Learning Algorithms
There is growing interest in algorithms that optimize information-theoretic objectives — mutual information, entropy, Fisher information — directly, rather than as surrogates for other goals. InfoNCE and related contrastive objectives in self-supervised learning, MINE (Mutual Information Neural Estimation), and information-bottleneck methods all involve optimizing quantities with direct information-geometric interpretations.
A systematic information-geometric analysis of these objectives — characterizing the manifold structure of their optimal solutions, the curvature of the optimization landscape, and the natural gradient updates — could significantly advance both theory and practice.
Connections with Physics and Thermodynamics
Deep connections link information geometry to statistical mechanics and thermodynamics. The Fisher information plays the role of a metric in the thermodynamic state space, the KL divergence is related to free energy differences, and the natural gradient descent algorithm has been related to overdamped Langevin dynamics on the statistical manifold.
These connections suggest that the thermodynamic formalism — partition functions, entropy, free energy, fluctuation theorems — can be imported into machine learning as tools for analyzing neural network training and generalization. Recent work on the information geometry of neural scaling laws and emergence in large language models has begun to explore these connections.
Quantum Information Geometry
The quantum generalization of information geometry — where the Fisher metric is replaced by the quantum Fisher information (SLD Fisher metric or Bures metric) on the manifold of density matrices — has been studied extensively in quantum information theory. As quantum machine learning matures, quantum analogues of the natural gradient and quantum Fisher information are becoming practically relevant. The classical and quantum theories share deep structural parallels, suggesting that insights from one will continue to illuminate the other.
Conclusion
Information geometry reveals that machine learning models are not merely optimization problems but geometric structures defined on manifolds of probability distributions. The Fisher information metric — canonically characterized by Chentsov's theorem as the unique statistically invariant Riemannian metric — endows these manifolds with a rich geometry whose curvature, geodesics, and distances all carry concrete statistical meaning.
For neural networks, this perspective transforms how we understand training, optimization, and generalization. The parameter space of a deep network is a high-dimensional statistical manifold with an intricate, anisotropic geometry characterized by a heavy-tailed Fisher spectrum, extensive symmetry-induced degeneracies, and curvature that varies dramatically across the manifold. Standard gradient descent, which ignores this geometry, is suboptimal; the natural gradient — the steepest descent direction in the Fisher metric — is the information-geometrically correct update, and its approximations (K-FAC, EKFAC, and their descendants) have demonstrated empirical improvements across a wide range of tasks.
Looking forward, information geometry may be essential for the next generation of AI theory. As models grow in scale and complexity, purely empirical descriptions of their behavior are insufficient — we need principled mathematical frameworks that can predict, explain, and constrain their properties. The differential-geometric viewpoint offers one such framework: it connects the local statistical structure of neural networks to global properties of their manifolds, providing invariant, intrinsic descriptions of learning dynamics that transcend any particular parameterization.
The century-old machinery of Riemannian geometry, refined in the context of probability theory by Fisher, Rao, Chentsov, and Amari, is not merely an elegant reframing of familiar ideas. It is a genuinely different way of seeing — one that may, in the coming decades, reshape our understanding of what learning is and what intelligence, natural or artificial, fundamentally does.
References
Amari, S., & Nagaoka, H. (2000). Methods of Information Geometry. American Mathematical Society & Oxford University Press. (Translated from the 1993 Japanese edition.)
Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.
Chentsov, N. N. (1982). Statistical Decision Rules and Optimal Inference. American Mathematical Society. (Translation of the 1972 Russian original.)
Fisher, R. A. (1925). Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22, 700–725.
Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–91.
Nielsen, F. (2020). An elementary introduction to information geometry. Entropy, 22(10), 1100.
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature. Proceedings of the 32nd International Conference on Machine Learning (ICML), 2408–2417.
Kakade, S. (2001). A natural policy gradient. Advances in Neural Information Processing Systems (NeurIPS), 14.
Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. NeurIPS, 31.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. Proceedings of the 32nd ICML, 1889–1897.