π€ AI Summary
This work identifies and addresses a critical issue in large reasoning models: the collapse of exploration capability during reinforcement learning-based post-training, which undermines temperature-based sampling and prevents improvements in pass@$n$ accuracy. To remedy this, the authors propose Latent Exploration Decoding (LED), a novel decoding strategy that requires no additional training or model parameters. LED restores diversity during inference by analyzing the entropy of posterior distributions in intermediate layers and aggregating high-entropy depth configurations. Evaluated across multiple reasoning benchmarks and model architectures, LED consistently enhances performance, yielding average gains of 0.61 and 1.03 percentage points in pass@1 and pass@16 accuracy, respectively, thereby effectively balancing reasoning performance with exploratory capacity.
π Abstract
Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://GitHub.com/Xiaomi-Research/LED.