🤖 AI Summary
This work addresses mode collapse in KL-regularized reinforcement learning (RL). We systematically analyze how forward and reverse KL divergences affect multimodal coverage of the target distribution, identifying regularization strength and reward scaling as key determinants of mode coverage. We propose a theoretically grounded, scalable algorithm that adaptively optimizes the target distribution solely by adjusting reward magnitude—without requiring auxiliary diversity signals. We validate our method on post-training tasks for both large language models (LLMs) and chemical language models (CLMs). Empirical results demonstrate significant improvements in generation quality and diversity under both forward- and reverse-KL settings. Crucially, our approach remains robust even under strong KL regularization or low reward scales—regimes where conventional methods fail—thereby overcoming the inherent diversity limitation of existing KL-regularized RL frameworks.
📝 Abstract
It is commonly believed that optimizing the reverse KL divergence results in"mode seeking", while optimizing forward KL results in"mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show -- mathematically and empirically -- that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the family of optimal target distributions, parameterized by the regularization coefficient. Mode coverage depends primarily on other factors, such as regularization strength, and relative scales between rewards and reference probabilities. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify unimodal target distributions, meaning the optimization objective is, by construction, non-diverse. We leverage these insights to construct a simple, scalable, and theoretically justified algorithm. It makes minimal changes to reward magnitudes, yet optimizes for a target distribution which puts high probability over all high-quality sampling modes. In experiments, this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without any external signals of diversity, and works with both forward and reverse KL when using either naively fails.