Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the limitation of large language models during inference, where autoregressive generation suffers from exponentially decaying sampling probabilities over long reasoning trajectories, leading to a “shallow exploration trap” that hinders effective in-context exploration. To overcome this, the authors propose a reinforcement learning framework incorporating length-based rewards and redundancy penalties, which—building upon state coverage theory—encourages the generation of longer and more diverse reasoning paths. This approach marks the first integration of state coverage principles into in-context exploration. Evaluated on mainstream models including Qwen3 and Llama, the method significantly enhances exploratory capabilities, yielding average performance gains of 4.4% on in-domain tasks and 2.7% on cross-domain benchmarks, thereby improving both reasoning depth and generalization.

Technology Category

Application Category

📝 Abstract

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

Problem

Research questions and friction points this paper is trying to address.

In-Context Exploration

State Coverage

Shallow Exploration Trap

Test-Time Scaling

Reasoning Trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Exploration

Length-Incentivized Reinforcement Learning

State Coverage