🤖 AI Summary
This work proposes a unified framework that systematically derives several classical results in exponential families through a concise identity involving the difference of Kullback–Leibler (KL) divergences and its inherent non-negativity. Relying solely on fundamental properties of KL divergence and the algebraic structure of exponential families, the approach reconstructs key results—such as the three-point and multi-point identities, the Pythagorean theorem in information geometry, and the Gibbs variational principle—without resorting to ad hoc or cumbersome proofs. Moreover, the framework naturally yields essential properties including the gradient formula for the log-partition function, the Bregman divergence representation, and the surjectivity of the moment map. These findings underscore the pivotal role of KL divergence as a unifying bridge linking information geometry, convex duality, and variational inference.
📝 Abstract
Exponential families encompass the distributions central to modern machine learning -- softmax, Gaussians, and Boltzmann distributions -- and underlie the theory of variational inference, entropy-regularized reinforcement learning, and RLHF. We isolate a simple identity for exponential families that expresses the KL difference $\mathrm{KL}(q \| p_{λ_2}) - \mathrm{KL}(q \| p_{λ_1})$ in terms of the log-partition function $A(λ)$ and the moment $μ_q$. Remarkably, this identity together with the single fact that $\mathrm{KL} \geq 0$ (with equality iff $p = q$) suffices, by direct substitution and rearrangement, to derive a cluster of results that are classically obtained by separate, heavier arguments: a generalized three-point identity for arbitrary reference distributions, Pythagorean theorems for I-projections and reverse I-projections, convexity of the log-partition function, identification of its Legendre dual in KL terms, the Gibbs variational principle, and the explicit optimizer in KL-regularized reward maximization, including the exponential tilting formula underlying entropy-regularized control and RLHF. Beyond these purely algebraic consequences, standard analytic arguments recover the gradient formula for the log-partition function, the Bregman representation of within-family KL divergence, and the surjectivity of the moment map. The note is self-contained.