π€ AI Summary
Optimizing large-scale or infinite-term LogSumExp functions faces two key challenges: expensive gradient computation and significant bias introduced by minibatch approximations. To address these, this paper proposes a novel convex approximation method grounded in the safe KL divergence. By reformulating the dual form of the KL divergence, the approach introduces an adjustable accuracy parameter that preserves the original problemβs convexity while completely eliminating sampling bias; moreover, its smoothness constant scales linearly with that of the original function. Leveraging stochastic gradient optimization and *f*-divergence theory, the method achieves both theoretical guarantees and computational efficiency. Experiments demonstrate substantial improvements over state-of-the-art approaches on distributionally robust optimization and continuous optimal transport tasks, while simultaneously mitigating the inherent numerical instability of the LogSumExp operator.
π Abstract
The LogSumExp function, also known as the free energy, plays a central role in many important optimization problems, including entropy-regularized optimal transport and distributionally robust optimization (DRO). It is also the dual to the Kullback-Leibler (KL) divergence, which is widely used in machine learning. In practice, when the number of exponential terms inside the logarithm is large or infinite, optimization becomes challenging since computing the gradient requires differentiating every term. Previous approaches that replace the full sum with a small batch introduce significant bias. We propose a novel approximation to LogSumExp that can be efficiently optimized using stochastic gradient methods. This approximation is rooted in a sound modification of the KL divergence in the dual, resulting in a new $f$-divergence called the safe KL divergence. The accuracy of the approximation is controlled by a tunable parameter and can be made arbitrarily small. Like the LogSumExp, our approximation preserves convexity. Moreover, when applied to an $L$-smooth function bounded from below, the smoothness constant of the resulting objective scales linearly with $L$. Experiments in DRO and continuous optimal transport demonstrate the advantages of our approach over state-of-the-art baselines and the effective treatment of numerical issues associated with the standard LogSumExp and KL.