Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

📅 2026-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of reinforcement learning in long-context reasoning, which stem from the scarcity of densely annotated reasoning data and the coarse-grained penalization of “nearly correct” trajectories. To overcome these challenges, the authors propose DeepReasonQA, a framework that synthesizes controllable multi-hop long-context question-answering data using knowledge graphs, and introduce LongPAS—a fine-grained process advantage shaping method that allocates credit based on the validity and relevance of individual reasoning steps. By effectively leveraging learning signals from near-optimal trajectories, the approach significantly outperforms the RLVR baseline across three long-context reasoning benchmarks, achieving state-of-the-art performance with fewer parameters while maintaining training stability.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs short-context reasoning, but its performance degrades in long-context scenarios that require both precise grounding and robust long-range reasoning. We identify the"almost-there"phenomenon in long-context reasoning, where trajectories are largely correct but fail at the final step, and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data that push LLMs beyond mere grounding toward sophisticated multi-hop reasoning; and (2) the loss of valuable learning signals during long-context RL training due to the indiscriminate penalization of partially correct trajectories with incorrect outcomes. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains. Building on this, we introduce Long-context Process Advantage Shaping (LongPAS), a simple yet effective method that performs fine-grained credit assignment by evaluating reasoning steps along Validity and Relevance dimensions, which captures critical learning signals from"almost-there"trajectories. Experiments on three long-context reasoning benchmarks show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters. Further analysis confirms the effectiveness of our methods in strengthening long-context reasoning while maintaining stable RL training.
Problem

Research questions and friction points this paper is trying to address.

long-context reasoning
reinforcement learning
multi-hop reasoning
reasoning density
credit assignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-context Reasoning
Reinforcement Learning with Verifiable Rewards
Process Advantage Shaping
Multi-hop QA Synthesis
Credit Assignment