Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies KL-divergence minimization over the probability simplex, focusing on the convergence behavior of gradient algorithms under the information-geometric framework—specifically in exponential-family (θ) and mixture-family (η) coordinates. Methodologically, it analyzes both continuous- and discrete-time natural gradient descent (NGD). Theoretically, it establishes for the first time that, in continuous time, NGD converges at a constant rate of 2—strictly faster than θ-coordinate Euclidean gradient descent and slower than η-coordinate Euclidean gradient descent—while its discrete-time variant achieves superior convergence speed and enhanced robustness to noise. A key contribution is the rigorous proof that NGD’s convergence rate is strictly invariant under affine reparameterizations—a geometric invariance stemming from precise spectral analysis of the Fisher information matrix and the Hessian condition number. Experiments confirm NGD consistently outperforms coordinate-dependent Euclidean methods in both convergence speed and stability.

Technology Category

Application Category

📝 Abstract
The Kullback-Leibler (KL) divergence plays a central role in probabilistic machine learning, where it commonly serves as the canonical loss function. Optimization in such settings is often performed over the probability simplex, where the choice of parameterization significantly impacts convergence. In this work, we study the problem of minimizing the KL divergence and analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems within the framework of information geometry$-$ the exponential family ($ heta$ coordinates) and the mixture family ($eta$ coordinates). We compare Euclidean gradient descent (GD) in these coordinates with the coordinate-invariant natural gradient descent (NGD), where the natural gradient is a Riemannian gradient that incorporates the intrinsic geometry of the parameter space. In continuous time, we prove that the convergence rates of GD in the $ heta$ and $eta$ coordinates provide lower and upper bounds, respectively, on the convergence rate of NGD. Moreover, under affine reparameterizations of the dual coordinates, the convergence rates of GD in $eta$ and $ heta$ coordinates can be scaled to $2c$ and $frac{2}{c}$, respectively, for any $c>0$, while NGD maintains a fixed convergence rate of $2$, remaining invariant to such transformations and sandwiched between them. Although this suggests that NGD may not exhibit uniformly superior convergence in continuous time, we demonstrate that its advantages become pronounced in discrete time, where it achieves faster convergence and greater robustness to noise, outperforming GD. Our analysis hinges on bounding the spectrum and condition number of the Hessian of the KL divergence at the optimum, which coincides with the Fisher information matrix.
Problem

Research questions and friction points this paper is trying to address.

Analyzing convergence of gradient descent for KL divergence minimization
Comparing Euclidean and natural gradient descent in dual coordinates
Demonstrating NGD's robustness and faster convergence in discrete time
Innovation

Methods, ideas, or system contributions that make the work stand out.

Natural gradient descent for KL divergence minimization
Comparison of GD in dual coordinate systems
NGD's robustness and faster discrete convergence
🔎 Similar Papers
No similar papers found.