Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision

πŸ“… 2025-02-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the limited efficacy of chain-of-thought (CoT) reasoning in large language models (LLMs) on long-context inference tasks and the inability of conventional result-based supervision to detect flaws in multi-step reasoning processes, this paper proposes LongRePSβ€”a novel framework for process-level supervision. LongRePS introduces the first context-aware reasoning path quality assessment protocol tailored for long texts, integrating a context-sensitive quality evaluator and a self-sampling mechanism to enable fine-grained, stepwise supervision over complex inference paths. Departing from traditional outcome-oriented paradigms, LongRePS achieves substantial improvements: up to +13.6 points on MuSiQue and an average +9.3-point gain on cross-domain question answering. These results demonstrate significant enhancements in long-text information aggregation and generalizable reasoning. The core contributions are (1) the first process supervision paradigm specifically designed for long-context reasoning, and (2) a scalable, principled framework for path-level quality evaluation.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Improving long-context reasoning in language models
Enhancing Chain-of-Thought effectiveness for extended contexts
Developing LongRePS for better reasoning path supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

LongRePS framework enhances long-context reasoning.
Self-sampling mechanism improves reasoning path quality.
Quality assessment protocol tailored for long contexts.
πŸ”Ž Similar Papers
No similar papers found.
D
Dawei Zhu
Peking University
X
Xiyu Wei
Peking University
Guangxiang Zhao
Guangxiang Zhao
Peking University
AI
W
Wenhao Wu
Peking University
Haosheng Zou
Haosheng Zou
Tsinghua University
Reinforcement Learning
J
Junfeng Ran
Peking University
X
Xun Wang
Peking University
Lin Sun
Lin Sun
Qihoo 360
large language model
Xiangzheng Zhang
Xiangzheng Zhang
360
AI safetyLarge language modelsInformation Retrieval
S
Sujian Li
Peking University