Tail-Aware Information-Theoretic Generalization for RLHF and SGLD

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the limitation of classical information-theoretic generalization bounds, which rely on light-tailed assumptions and fail under heavy-tailed losses or rewards—as commonly encountered in methods like RLHF and SGLD. The paper introduces the first tail-aware information-theoretic framework adaptable to any tail exponent θ, centered on a novel f_θ-divergence incorporating a shifted logarithm. It establishes a precise connection between this divergence and Rényi divergence, and combines sub-Weibull process maximal inequalities with Dudley-type chaining to derive a multi-scale Rényi mutual information chaining bound. This approach overcomes the light-tail restriction, yielding the first expected and high-probability PAC-Bayes generalization guarantees for Rényi-regularized RLHF with heavy-tailed rewards and for SGLD under heavy-tailed gradient noise.

Technology Category

Application Category

📝 Abstract

Classical information-theoretic generalization bounds typically control the generalization gap through KL-based mutual information and therefore rely on boundedness or sub-Gaussian tails via the moment generating function (MGF). In many modern pipelines, such as robust learning, RLHF, and stochastic optimization, losses and rewards can be heavy-tailed, and MGFs may not exist, rendering KL-based tools ineffective. We develop a tail-dependent information-theoretic framework for sub-Weibull data, where the tail parameter $θ$ controls the tail heaviness: $θ=2$ corresponds to sub-Gaussian, $θ=1$ to sub-exponential, and $0<θ<1$ to genuinely heavy tails. Our key technical ingredient is a decorrelation lemma that bounds change-of-measure expectations using a shifted-log $f_θ$-divergence, which admits explicit comparisons to Rényi divergence without MGF arguments. On the empirical-process side, we establish sharp maximal inequalities and a Dudley-type chaining bound for sub-Weibull processes with tail index $θ$, with complexity scaling as $\log^{1/θ}$ and entropy$^{1/θ}$. These tools yield expected and high-probability PAC-Bayes generalization bounds, as well as an information-theoretic chaining inequality based on multiscale Rényi mutual information. We illustrate the consequences in Rényi-regularized RLHF under heavy-tailed rewards and in stochastic gradient Langevin dynamics with heavy-tailed gradient noise.

Problem

Research questions and friction points this paper is trying to address.

heavy-tailed

generalization bounds

information-theoretic

RLHF

SGLD

Innovation

Methods, ideas, or system contributions that make the work stand out.

sub-Weibull

tail-aware generalization

f-divergence

PAC-Bayes

Rényi mutual information

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Research Engineer, Monetization AI