Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the limitation of current language models in tasks characterized by solution multiplicity or irreducible uncertainty—such as medical diagnosis and ambiguous question answering—where models typically produce only a single deterministic answer, failing to capture the full answer distribution. To overcome this, the authors propose a multi-answer reinforcement learning framework that explicitly extends the reinforcement learning objective to model the entire distribution over plausible answers. Their approach enables the model, within a single forward pass, to simultaneously generate multiple valid responses along with calibrated confidence scores, effectively internalizing the reasoning search process. Without requiring repeated sampling, the method substantially improves inference efficiency while enhancing answer diversity, coverage, and set-level calibration across diverse benchmarks including question answering, medical diagnosis, and code generation, where it also significantly boosts code accuracy and reduces computational overhead.

Technology Category

Application Category

📝 Abstract

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.

Problem

Research questions and friction points this paper is trying to address.

distributional reasoning

multiple answers

language models

uncertainty

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-answer reinforcement learning

distributional reasoning

language models