π€ AI Summary
To address the limited efficacy of chain-of-thought (CoT) reasoning in large language models (LLMs) on long-context inference tasks and the inability of conventional result-based supervision to detect flaws in multi-step reasoning processes, this paper proposes LongRePSβa novel framework for process-level supervision. LongRePS introduces the first context-aware reasoning path quality assessment protocol tailored for long texts, integrating a context-sensitive quality evaluator and a self-sampling mechanism to enable fine-grained, stepwise supervision over complex inference paths. Departing from traditional outcome-oriented paradigms, LongRePS achieves substantial improvements: up to +13.6 points on MuSiQue and an average +9.3-point gain on cross-domain question answering. These results demonstrate significant enhancements in long-text information aggregation and generalizable reasoning. The core contributions are (1) the first process supervision paradigm specifically designed for long-context reasoning, and (2) a scalable, principled framework for path-level quality evaluation.
π Abstract
Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.