Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

In reinforcement learning (RL)-based fine-tuning of large language models (LLMs) for reasoning tasks, optimizing the reverse KL divergence severely compromises output diversity. To address this, we propose an explicit target distribution modeling framework grounded in the α-divergence family: candidate reasoning paths are generated by a pretrained LLM and filtered to construct a target distribution that jointly balances correctness and diversity; α-divergence then serves as a unified objective to trade off precision (mode-seeking) and coverage (distribution support). This enables flexible control over the precision–coverage spectrum. Evaluated on the Lean theorem-proving benchmark, our method achieves the first Pareto-optimal frontier in coverage–precision trade-off, notably improving coverage by +12.7% while preserving precision. Our approach establishes a novel paradigm for enhancing reasoning diversity in LLMs through principled divergence-based distribution alignment.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the $α$-divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage-precision Pareto frontier, outperforming all prior methods on the coverage axis.

Problem

Research questions and friction points this paper is trying to address.

Addresses diversity loss in LLMs from RL tuning

Proposes filtering incorrect answers to preserve correct probabilities

Uses α-divergence to balance precision and diversity trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

Filtering correct answers to define target distribution

Using α-divergence to control precision-diversity trade-off

Achieving Pareto frontier coverage in theorem-proving benchmark

🔎 Similar Papers

No similar papers found.