Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work identifies and addresses a critical issue in large reasoning models: the collapse of exploration capability during reinforcement learning-based post-training, which undermines temperature-based sampling and prevents improvements in pass@$n$ accuracy. To remedy this, the authors propose Latent Exploration Decoding (LED), a novel decoding strategy that requires no additional training or model parameters. LED restores diversity during inference by analyzing the entropy of posterior distributions in intermediate layers and aggregating high-entropy depth configurations. Evaluated across multiple reasoning benchmarks and model architectures, LED consistently enhances performance, yielding average gains of 0.61 and 1.03 percentage points in pass@1 and pass@16 accuracy, respectively, thereby effectively balancing reasoning performance with exploratory capacity.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://GitHub.com/Xiaomi-Research/LED.

Problem

Research questions and friction points this paper is trying to address.

exploration collapse

Large Reasoning Models

post-training

entropy reduction

temperature-based sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Exploration Decoding

Exploration Collapse

Entropy Asymmetry