π€ AI Summary
Post-training often induces output convergence in large language models (LLMs), sacrificing generative diversity for accuracy and thereby degrading performance on creative and exploratory tasks. To address this, we propose DARLING, a framework that jointly optimizes generation quality and deep semantic diversity within an online reinforcement learning (RL) paradigm. DARLING introduces a learnable semantic partitioning function to formalize diversity as an explicit, differentiable reward signal; it further integrates contrastive representation learning with semantic clustering to precisely quantify distributional divergence among generated outputs. Crucially, the method requires no human annotations or predefined categories. Extensive evaluation across multiple LLMs demonstrates its effectiveness: on non-verification tasks, it surpasses baselines in both quality and novelty; on code-generation verification tasks, it achieves significant improvements in pass@1 and pass@kβmarking the first successful joint enhancement of quality and diversity within the RLHF framework.
π Abstract
Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.