Why optimization in deep learning is a geometric problem

At the heart of modern machine learning lies a deceptively simple question: how do we find the parameters of a model that make it perform well? Whether we are training a language model with billions of parameters or a compact convolutional network for image classification, the process always reduces to the same mathematical challenge — solving an optimization problem of the form
where is the parameter vector and is a scalar-valued loss function measuring how poorly the model fits the training data.
In principle, this is just calculus: set the gradient to zero and solve. In practice, however, is a highly nonlinear, non-convex function defined over a parameter space of enormous dimension. Modern large language models have parameters; even "small" networks typically have . Closed-form solutions are out of the question. Instead, we rely on iterative first-order methods — most fundamentally, gradient descent.
What makes this problem genuinely interesting — and genuinely difficult — is its geometric character. The loss function carves out a landscape over parameter space, and gradient descent is a dynamical system navigating that landscape. Whether training succeeds or fails depends critically on the topological and geometric structure of this landscape: where are the valleys? How deep are they? Are there flat plateaus that trap the optimizer? Are there saddle points where the gradient vanishes but no minimum has been reached?
This article develops a rigorous account of the geometry underlying gradient descent. We cover the fundamental algorithm, its second-order analysis via the Hessian, the classification of critical points, the peculiar prevalence of saddle points in high-dimensional optimization, and the modern theoretical and empirical understanding of loss landscapes in deep learning.
The gradient descent update rule is the simplest iterative scheme for minimizing a differentiable function:
Here:
The gradient points in the direction of steepest ascent of at — the direction along which the function increases most rapidly for a unit perturbation. By moving in the opposite direction , we take a step of steepest descent.
More precisely, the gradient is defined as the unique vector such that for any direction :
By the Cauchy–Schwarz inequality, the directional derivative is maximized when . The negative of this direction, , thus provides the direction of steepest local decrease.
Under appropriate smoothness conditions — specifically, when is -smooth (its gradient is -Lipschitz continuous) — one can show that gradient descent with step size guarantees a decrease in the loss at every iteration until a critical point is reached. The decrease per step is at least:
In practice, determining an appropriate learning rate is nontrivial and is itself a major area of research, involving schedules, adaptive methods, and theoretical analyses of stability.
It is useful — and geometrically accurate — to think of gradient descent as navigation on a surface. The loss function defines a hypersurface in : the graph . We call this the .
For or , we can visualize this landscape directly as a curve or a surface in three dimensions. The horizontal axes represent parameter values; the vertical axis represents loss. Gradient descent is then literally the process of rolling downhill on this surface, following the direction of steepest descent at each point.
The landscape contains several geometrically distinct features:
Valleys correspond to regions where the loss is lower than the surrounding area. A sharp, narrow valley indicates high curvature, while a wide, flat valley indicates low curvature. The geometry of valleys has profound implications for generalization: a model sitting at the bottom of a wide, flat valley is believed to generalize better than one at the bottom of a sharp, narrow minimum, because the former is less sensitive to perturbations of parameters.
Ridges are high-curvature regions that separate valleys. In high-dimensional spaces, ridges are hypersurfaces that form the boundaries between different basins of attraction.
Plateaus are regions where the gradient is very small in magnitude but no local minimum has been found. These are particularly dangerous for optimization: gradient descent makes negligible progress on plateaus, and training can appear to stall indefinitely.
In deep learning, may be to , making the full landscape literally unimaginable. Any "picture" of the loss landscape is necessarily a low-dimensional projection or slice, and our geometric intuitions, built in two and three dimensions, must be applied with caution. Nonetheless, the mathematical theory of critical points, curvature, and gradient flows extends cleanly to arbitrary dimension.
To understand the local behavior of gradient descent rigorously, we apply the Taylor expansion of around the current parameter vector :
The Hessian matrix is the matrix of second partial derivatives:
Since is real-valued, is symmetric by Schwarz's theorem (assuming continuity of second derivatives). This symmetry is crucial: it ensures that is orthogonally diagonalizable with real eigenvalues, giving it a clean spectral theory.
The Taylor expansion reveals the geometry of the loss surface at through the quadratic term . This quadratic form determines how the loss changes as we move away from in direction :
This spectral analysis of the Hessian is the key tool for understanding local curvature, and thus for understanding how gradient descent behaves locally.
The second-order Taylor expansion also provides a principled derivation of the learning rate bound. Applying the update :
For the loss to decrease, we need:
The worst-case direction is the leading eigenvector of with eigenvalue , giving the condition . This is precisely the smoothness-based bound. High curvature (large ) forces a small learning rate.
A critical point (also called a stationary point) of is any parameter vector where the gradient vanishes:
At a critical point, gradient descent makes no progress: . The optimizer has "stopped." The central question is: what kind of critical point has it found?
The Hessian at a critical point completely determines the local geometry (up to second order) and classifies the critical point into one of three types.
is a strict local minimum if there exists such that for all with . The second-order necessary condition is:
The sufficient condition is strict positive definiteness: , meaning all eigenvalues of are strictly positive. Geometrically, the loss surface is locally bowl-shaped around : every direction away from leads uphill.
is a strict local maximum if — all eigenvalues are strictly negative. Every direction away from leads downhill. In high-dimensional optimization, local maxima are extremely rare: for a point to be a local maximum requires all eigenvalues to be negative, which becomes exponentially unlikely as grows. We essentially never encounter them in neural network training.
is a saddle point if is indefinite — it has both positive and negative eigenvalues. At a saddle point, some directions lead uphill (positive curvature) and others lead downhill (negative curvature). In the two-dimensional case, the classic picture is a mountain pass: higher than the valleys on two sides, lower than the peaks on two other sides.
More precisely, at a saddle point:
The escape directions are the eigenvectors of corresponding to negative eigenvalues — directions along which the surface slopes downward. A gradient-based optimizer that could identify these directions would escape the saddle; the challenge is that at , the gradient is zero and provides no information.
The analysis of critical points in high-dimensional parameter spaces reveals a striking fact that distinguishes deep learning from classical optimization: saddle points vastly outnumber local minima.
To see why, consider a heuristic probabilistic argument. At a randomly chosen critical point in a network with parameters, the Hessian has eigenvalues. For to be a local minimum, all eigenvalues must be positive. If we model the sign of each eigenvalue as an independent Bernoulli random variable with probability of being positive, the probability of a local minimum is , which shrinks exponentially in . For any , the vast majority of critical points are saddle points.
This argument is heuristic, but it is supported by rigorous results from random matrix theory. In the limit of large, randomly initialized networks, the expected fraction of critical points that are saddle points approaches 1, and the local minima that do exist tend to cluster near the global minimum.
Near a saddle point, the gradient is small in magnitude. For the optimizer near :
(using a linear approximation of the gradient). This is a linear dynamical system. Along eigenvector with eigenvalue , the component of evolves as:
For positive (uphill direction), for small , so the optimizer contracts toward along : it is attracted. For negative (downhill direction), , so the optimizer expands away from along : it is repelled, and will naturally escape in the escape direction.
However, this dynamics is asymmetric: the attraction along uphill directions can dominate in early iterations, causing the optimizer to approach the saddle before the repulsion along escape directions takes effect. This gives rise to the phenomenon of saddle-point slowing: gradient descent can linger near a saddle point for many iterations, making negligible progress, before eventually escaping.
A saddle point is strict if all eigenvalues of are nonzero (i.e., is nonsingular at the saddle). Strict saddle points have well-defined escape directions, and gradient descent — especially with noise — can escape them efficiently.
A degenerate saddle has one or more zero eigenvalues. These are more dangerous: in the degenerate subspace, neither the gradient nor the second-order information provides guidance, and higher-order analysis or perturbation methods may be required. Flat regions of the loss landscape, where the Hessian has many near-zero eigenvalues, are particularly challenging.
The Hessian captures not just the type of critical point but the speed of convergence throughout the optimization trajectory. Understanding curvature is central to understanding why some regions of parameter space are easy to optimize through, while others trap the optimizer.
Define the condition number of the Hessian at as:
where and are the largest and smallest positive eigenvalues (assuming is positive definite near a minimum for simplicity).
A large condition number means the loss landscape is highly anisotropic: very steep in some directions and very flat in others. This is precisely the situation of ill-conditioning, and it causes severe difficulties for gradient descent.
To see why, suppose the Hessian is diagonal with eigenvalues (a highly anisotropic "valley"). The optimal step size for the steep direction is , while the optimal step size for the flat direction is . If is large, these are very different; any fixed step size either overshoots in the steep direction or makes negligible progress in the flat direction.
Gradient descent with a fixed step size on a quadratic with condition number converges at rate:
For large , , and convergence is extremely slow. This is why poorly conditioned problems require many more iterations.
A more sophisticated geometric view replaces the Euclidean metric on parameter space with a Riemannian metric induced by the loss curvature. In natural gradient methods, the metric tensor is taken to be the Fisher information matrix , which provides a natural geometry on the space of probability distributions parameterized by . The natural gradient update:
preconditions the gradient by the inverse of the Fisher metric, effectively rescaling each parameter direction by the local curvature. This transforms the optimization problem into a better-conditioned one and is the basis of natural gradient descent and the K-FAC optimizer.
The most direct way to account for curvature is to use second-order optimization methods, which explicitly incorporate the Hessian into the update step.
Newton's method derives from a second-order Taylor expansion. At , we minimize the local quadratic approximation:
Setting gives the Newton step:
When is positive definite, this update exactly minimizes the local quadratic approximation and achieves quadratic convergence near a local minimum: the number of correct decimal places doubles at each iteration.
The geometric intuition is clear: Newton's method accounts for the anisotropy of the loss landscape. By multiplying the gradient by , it rescales the step to account for how curved the landscape is in each direction. Where the surface is steep (large eigenvalue), the step is small; where the surface is flat (small eigenvalue), the step is large.
Despite its theoretical elegance, Newton's method faces severe practical limitations in deep learning:
Computational cost. Computing and storing the full Hessian requires memory and time to invert. For parameters, storing requires floating-point numbers — about a petabyte of memory. This is completely infeasible.
Instability near saddle points. When has negative eigenvalues (saddle point), is also indefinite, and the Newton step points in an ascent direction along the negative-curvature modes. Rather than escaping the saddle, Newton's method can move toward it. Various modifications (trust-region methods, Levenberg–Marquardt regularization) address this by replacing with for large enough to ensure positive definiteness.
Quasi-Newton methods such as L-BFGS approximate the Hessian using a history of gradient differences, achieving super-linear convergence without the memory cost. These are widely used in full-batch optimization but become less effective in the stochastic (mini-batch) setting.
Over the past decade, the geometry of neural network loss landscapes has become a major research area. We survey the key empirical and theoretical findings.
A central conjecture in deep learning theory — originally proposed by Hochreiter and Schmidhuber (1997) and revived by Keskar et al. (2017) — is that flat minima generalize better than sharp minima.
The argument is intuitive: if a minimum is "sharp" (high Hessian curvature), a small perturbation of — corresponding to a distribution shift in the input data — produces a large increase in loss. A model sitting at a flat minimum is robust to such perturbations and therefore generalizes to unseen data.
Define the sharpness of a minimum as:
Keskar et al. showed empirically that large-batch training tends to converge to sharper minima with worse generalization, while small-batch training finds flatter minima. Subsequent work has complicated this picture (showing that learning rate and other factors matter as much as batch size), but the fundamental connection between Hessian flatness and generalization remains an active area.
One rigorous connection comes from PAC-Bayes theory: the Hessian trace appears as a complexity measure in generalization bounds, with smaller trace corresponding to better expected generalization.
A striking empirical finding — shown by Garipov et al. (2018) and Draxler et al. (2018) — is that different local minima of neural network loss functions are often connected by low-loss paths in parameter space, sometimes called "loss tunnels." Along these paths, the loss remains near its minimum value despite large changes in the parameter vector.
This finding suggests that the loss landscape, rather than having isolated local minima separated by high barriers, forms a connected set of low-loss regions. Gradient descent starting from different random initializations finds different points in this low-loss set, which may explain why neural networks are relatively insensitive to initialization (beyond the critical need to break symmetry).
This geometry is sometimes described as a loss basin or flat valley — a high-dimensional region of near-constant low loss, contrasted with sharp barriers at higher loss values.
Direct visualization of the loss landscape is impossible for , but low-dimensional projections provide geometric intuition. The method of Li et al. (2018) ("loss landscape visualization") plots as a function of for two random directions normalized to the Hessian scale. This 2D plot reveals whether the minimum is "wide" or "sharp."
The filter normalization technique (normalizing so that for each layer ) partially corrects for the scale-invariance of batch-normalized networks, making visualizations more meaningful.
Given the prevalence of saddle points, understanding how gradient-based methods escape them is crucial.
In practice, deep learning uses stochastic gradient descent (SGD): at each iteration, the gradient is estimated from a mini-batch of data rather than the full dataset. The stochastic gradient is an unbiased estimate of :
where is a random mini-batch. The gradient noise acts as a random perturbation. Near a saddle point, this noise has components along the escape directions (negative-curvature eigenvectors of ), and these components grow exponentially under the gradient dynamics. SGD therefore escapes saddle points more readily than full-batch gradient descent.
The covariance of the gradient noise is approximately , where . This covariance has been related to the curvature of the loss landscape, providing a theoretical link between mini-batch noise and implicit regularization.
For a more controlled escape mechanism, perturbed gradient descent (Ge et al., 2015; Jin et al., 2017) adds an explicit random perturbation whenever the gradient is small:
where is isotropic Gaussian noise.
Jin et al. (2017) proved that perturbed gradient descent (PSGD) escapes strict saddle points in polynomial time:
Theorem (Jin et al., 2017): Let be -smooth with -Hessian Lipschitz. Perturbed gradient descent finds an -approximate second-order stationary point (where and ) in iterations.
An approximate second-order stationary point is essentially a local minimum (up to -tolerance) because the Hessian is nearly positive semi-definite. This result shows that escaping saddle points in nonconvex optimization is tractable — saddle points are not a fundamental barrier.
Momentum methods maintain a running average of past gradients, which accumulates in directions of consistent gradient and cancels in oscillating directions. The classical momentum update:
is known to improve convergence in ill-conditioned problems by effectively increasing the step size in low-curvature directions. Near a saddle point, momentum can help by carrying the optimizer through the saddle before the slow gradient dynamics can pull it back.
More recently, adaptive methods such as Adam (Kingma and Ba, 2014) maintain per-parameter step sizes adapted to the history of gradient magnitudes, providing effective preconditioning that implicitly accounts for local curvature.
The spectrum of the Hessian — the distribution of its eigenvalues — has become a key diagnostic tool for understanding training dynamics. Ghorbani et al. (2019) showed that the Hessian spectrum of neural networks has a distinctive structure: a bulk of small eigenvalues near zero, and a few large outliers. The outliers correspond to the most influential parameter directions; the bulk indicates a vast near-flat subspace in parameter space.
This spectral structure partially explains both the success of gradient descent (the near-flat subspace offers little resistance) and the challenges of optimization (the large outliers create ill-conditioning).
In the infinite-width limit, neural networks have a remarkable simplification: training dynamics become linear in the parameters, governed by the Neural Tangent Kernel (NTK):
where is the network output. In this regime, the loss landscape is convex, and gradient descent converges to a global minimum. The NTK framework provides exact convergence guarantees for overparameterized networks but requires infinite width — a strong idealization.
Topological data analysis (TDA) tools, particularly persistent homology, have begun to be applied to loss landscapes. By tracking the birth and death of topological features (connected components, loops, voids) as a filtration of the sublevel sets sweeps through loss values, researchers can quantify the complexity of the landscape in a way that is invariant to smooth deformations.
The analogy between neural network optimization and disordered systems in statistical physics — particularly spin glasses — has been influential. The loss landscape of a spin glass has an exponential number of local minima at high energy, transitioning to a connected set of near-global minima at low energy. Choromanska et al. (2015) formalized this analogy and argued that local minima of neural networks (beyond a certain low loss threshold) are approximately as good as the global minimum, providing a theoretical basis for why gradient descent works despite non-convexity.
The geometric understanding of gradient descent has direct implications for how practitioners train neural networks.
Optimizer design. The geometry of the loss landscape motivates choices of optimizer. For ill-conditioned landscapes (large condition number), adaptive methods like Adam or second-order approximations like K-FAC can dramatically accelerate training. For well-conditioned problems near global minima, standard SGD with momentum often achieves better generalization, possibly by finding flatter minima.
Learning rate schedules. Early in training, when the optimizer is far from any minimum, a large learning rate enables rapid traversal of the loss landscape. Late in training, when the optimizer is near a minimum, a small learning rate enables fine-grained navigation without overshooting. Learning rate warmup — starting small, increasing, then decaying — mirrors the changing geometric character of the landscape during training.
Batch size effects. Larger mini-batches reduce gradient noise, which can help convergence speed but at the cost of finding sharper minima with worse generalization. The "linear scaling rule" (Goyal et al., 2017) advocates increasing the learning rate proportionally with batch size, which approximately preserves the noise-to-curvature ratio and thus the implicit regularization of SGD.
Initialization. The choice of initialization determines the starting point on the loss landscape. Random initialization must break symmetry (to ensure different neurons learn different features) while avoiding vanishing or exploding gradients (which correspond to extreme curvature anisotropy in early layers). He initialization and Xavier initialization are calibrated to maintain the variance of activations across layers, keeping the optimizer in a well-behaved region of the landscape from the first iteration.
Regularization and landscape geometry. Weight decay (L2 regularization) adds a term to the loss, which increases the curvature of the landscape (adding to the Hessian). This can improve conditioning by reducing the condition number, and it implicitly penalizes sharp minima by preferring small-norm parameters.
Gradient descent is, at its core, a geometric algorithm. Its behavior — whether it converges, how fast, to what kind of point — is determined by the geometry of the loss landscape: the curvature encoded in the Hessian, the structure of critical points, the topology of level sets, and the global shape of valleys and plateaus.
In the low-dimensional setting of classical optimization, these geometric ideas were well understood by the mid-twentieth century. What makes deep learning different is the extraordinary dimension of the parameter space, which creates qualitatively new phenomena: an exponential proliferation of saddle points, a connected set of near-equivalent global minima, a Hessian spectrum dominated by a near-zero bulk, and the emergence of emergent geometry at the level of the full training trajectory.
Several key insights are now well established:
Saddle points, not local minima, are the primary obstacle in high-dimensional optimization. Most critical points in large neural networks are saddle points; true local minima are rare and, when they exist, tend to cluster near the global minimum.
Curvature anisotropy (ill-conditioning) is pervasive in deep networks and is the main reason fixed-learning-rate gradient descent can be slow. Adaptive methods and preconditioning strategies can partially compensate, at the cost of generalization or computational overhead.
Flat minima generalize better than sharp minima — a widely supported empirical finding that connects the geometry of the landscape to statistical learning theory.
Noise helps. The stochasticity of mini-batch gradients, far from being a nuisance, is a crucial feature: it provides the perturbations needed to escape saddle points and implicitly regularizes toward flatter, better-generalizing minima.
Looking forward, the geometry of neural network optimization remains a rich and open subject. A full theoretical understanding of why deep learning works as well as it does — why gradient descent reliably finds solutions that generalize, despite the non-convexity and high dimensionality of the problem — remains one of the central open questions in machine learning theory. The tools to answer it will come from differential geometry, algebraic topology, random matrix theory, and statistical physics, applied together to the problem of navigating the extraordinary landscapes carved by deep networks.
Understanding this geometry is not merely an academic exercise. As neural networks grow larger and are deployed in higher-stakes settings, the ability to reason rigorously about optimization dynamics — when will training succeed, why did it fail, how can we make it more reliable — becomes practically essential. The geometry of gradient descent is the mathematics of modern AI.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer Series in Operations Research. Springer.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). Visualizing the loss landscape of neural nets. Advances in Neural Information Processing Systems (NeurIPS), 31.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. International Conference on Learning Representations (ICLR).
Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., & Jordan, M. I. (2017). How to escape saddle points efficiently. International Conference on Machine Learning (ICML), PMLR 70.
Ge, R., Huang, F., Jin, C., & Yuan, Y. (2015). Escaping from saddle points — online stochastic gradient for tensor decomposition. Conference on Learning Theory (COLT).
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015). The loss surfaces of multilayer networks. Artificial Intelligence and Statistics (AISTATS), PMLR 38.
Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems (NeurIPS), 31.
Ghorbani, B., Krishnan, S., & Xiao, Y. (2019). An investigation into neural net optimization via Hessian eigenvalue density. International Conference on Machine Learning (ICML), PMLR 97.
Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems (NeurIPS), 27.
Applied mathematician and AI practitioner. Founder of MathLumen, exploring mathematics behind machine learning and scientific AI.

Inside the equation that powers modern AI
The Transformer architecture revolutionized AI with a single mechanism: self-attention. We break down the linear...

The Riemannian manifold of probability distributions
Information geometry equips the space of probability distributions with a Riemannian metric — the Fisher information....

The German mathematician's proof of the Mordell conjecture — and decades of structural insight — earn mathematics' highest honour
Gerd Faltings has been awarded the 2026 Abel Prize for introducing powerful tools in arithmetic geometry and resolving...