Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

πŸ“… 2026-02-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work identifies and addresses a critical issue in large reasoning models: the collapse of exploration capability during reinforcement learning-based post-training, which undermines temperature-based sampling and prevents improvements in pass@$n$ accuracy. To remedy this, the authors propose Latent Exploration Decoding (LED), a novel decoding strategy that requires no additional training or model parameters. LED restores diversity during inference by analyzing the entropy of posterior distributions in intermediate layers and aggregating high-entropy depth configurations. Evaluated across multiple reasoning benchmarks and model architectures, LED consistently enhances performance, yielding average gains of 0.61 and 1.03 percentage points in pass@1 and pass@16 accuracy, respectively, thereby effectively balancing reasoning performance with exploratory capacity.

Technology Category

Application Category

πŸ“ Abstract
Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://GitHub.com/Xiaomi-Research/LED.
Problem

Research questions and friction points this paper is trying to address.

exploration collapse
Large Reasoning Models
post-training
entropy reduction
temperature-based sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Exploration Decoding
Exploration Collapse
Entropy Asymmetry
Large Reasoning Models
Depth-Conditioned Decoding
πŸ”Ž Similar Papers
No similar papers found.