Better Estimation of the KL Divergence Between Language Models

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the high variance and occurrence of negative values in Monte Carlo estimation of KL divergence between language models. We propose the first Rao–Blackwellized unbiased estimator for both KL divergence and its gradient—marking the first systematic application of Rao–Blackwell theory to KL estimation in language modeling, with provable variance reduction and a natural extension to gradient estimation. In sentiment-controlled fine-tuning experiments, our estimator substantially reduces estimation variance and eliminates negative KL estimates entirely. When integrated into RLHF training via its gradient, optimization becomes markedly more stable, and the resulting policies consistently achieve superior trade-offs on the Pareto frontier between reward maximization and KL regularization. These results empirically validate both the theoretical advantages—guaranteed variance reduction and unbiasedness—and the practical utility of our estimator in alignment training.

Technology Category

Application Category

📝 Abstract

Estimating the Kullback--Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to the use of sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance, and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially in practice. Additionally, we derive an analogous Rao--Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.

Problem

Research questions and friction points this paper is trying to address.

Estimating KL divergence between language models accurately

Reducing variance in sampling-based KL divergence estimators

Improving stability in KL divergence gradient estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rao-Blackwellized estimator reduces KL divergence variance

Unbiased estimator with provably lower variance

Rao-Blackwellized gradient estimator stabilizes training

🔎 Similar Papers

No similar papers found.