LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited performance on complex long-context reasoning tasks—such as high-interference multi-hop question answering—due to insufficient modeling of advanced cognitive patterns and scarcity of high-quality long-context reinforcement learning (RL) data; existing RL approaches focus predominantly on short-context “insight”-style reasoning. This paper introduces KeyChain, a synthetic data generation framework that constructs challenging long-context training instances via UUID-based chain insertion. KeyChain is the first method to elicit a generalizable four-stage reasoning pattern: *plan → retrieve → reason → recheck*. Integrated with context-aware retrieval and RL fine-tuning, it enables zero-shot extrapolation from standard context lengths to 128K tokens. On Qwen2.5-7B and Qwen2.5-14B, accuracy improves by 23.5% and 21.1%, respectively; LoongRL-14B achieves 74.2 points—on par with o3-mini—and passes all 128K-token “needle-in-a-haystack” evaluations.

Technology Category

Application Category

📝 Abstract

Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Enhancing long-context reasoning in large language models

Addressing scarcity of high-difficulty reinforcement learning data

Developing efficient reasoning methods beyond training context length

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes long-context tasks with UUID chains

Induces plan-retrieve-reason-recheck reasoning pattern

Enables length generalization without full-length training

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study