Leveraging Flatness to Improve Information-Theoretic Generalization Bounds for SGD

📅 2026-01-04

🏛️ International Conference on Learning Representations

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the insensitivity of existing information-theoretic generalization bounds to the flatness preference of stochastic gradient descent (SGD), which renders them overly loose and unable to reflect the generalization advantage conferred by flat minima. The authors propose a novel “omniscient trajectory” technique that explicitly incorporates SGD’s flatness bias into information-theoretic bounds by linking high-variance directions in the weight covariance with the local curvature of the loss landscape, yielding substantially tighter bounds. This approach overcomes the traditional blindness of such bounds to flatness and circumvents the trade-off between memorization and generalization. Theoretical analysis shows that, under convex, Lipschitz, and bounded settings, the new bound improves the typical Ω(1) rate to O(1/√n). Empirical results confirm that the bound accurately captures the generalization benefit of flat minima in deep networks and is significantly tighter in practice.

Technology Category

Application Category

📝 Abstract

Information-theoretic (IT) generalization bounds have been used to study the generalization of learning algorithms. These bounds are intrinsically data- and algorithm-dependent so that one can exploit the properties of data and algorithm to derive tighter bounds. However, we observe that although the flatness bias is crucial for SGD's generalization, these bounds fail to capture the improved generalization under better flatness and are also numerically loose. This is caused by the inadequate leverage of SGD's flatness bias in existing IT bounds. This paper derives a more flatness-leveraging IT bound for the flatness-favoring SGD. The bound indicates the learned models generalize better if the large-variance directions of the final weight covariance have small local curvatures in the loss landscape. Experiments on deep neural networks show our bound not only correctly reflects the better generalization when flatness is improved, but is also numerically much tighter. This is achieved by a flexible technique called"omniscient trajectory". When applied to Gradient Descent's minimax excess risk on convex-Lipschitz-Bounded problems, it improves representative IT bounds'$\Omega(1)$ rates to $O(1/\sqrt{n})$. It also implies a by-pass of memorization-generalization trade-offs.

Problem

Research questions and friction points this paper is trying to address.

flatness

information-theoretic generalization bounds

SGD

generalization

loss landscape

Innovation

Methods, ideas, or system contributions that make the work stand out.

flatness

information-theoretic generalization bounds

SGD