🤖 AI Summary
This work addresses the challenge that existing reinforcement learning approaches struggle to effectively guide complex reasoning in long-context scenarios due to their reliance on sparse final rewards. To overcome this limitation, the authors propose LongR, a novel framework featuring an alternating “Think-and-Read” mechanism that interleaves reasoning and context retrieval. LongR further introduces a dense utility reward based on relative information gain, enabling fine-grained supervision of the reasoning process. The framework is compatible with multiple reinforcement learning algorithms, including DAPO and GSPO, and demonstrates consistent performance gains—achieving a 9% improvement on LongBench v2 and notable enhancements on both RULER and InfiniteBench. These results significantly advance the efficiency and robustness of large language models in long-context reasoning tasks.
📝 Abstract
Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic"Think-and-Read"mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.