🤖 AI Summary
To address challenges in multi-hop reasoning and cross-document evidence linking under ultra-long contexts (up to 4M tokens), as well as training instability in long-sequence reinforcement learning (RL), this paper proposes: (1) a novel data synthesis pipeline tailored for long-context instruction tuning; (2) AEPO—an adaptive entropy-controlled policy optimization framework ensuring task-balanced, stable RL training; and (3) a memory-augmented, multi-stage fusion RL architecture. Evaluated on established long-context benchmarks, our approach matches or exceeds the performance of GPT-5 and Gemini-2.5-Pro, achieving an average gain of +9.90 points. On tasks with context lengths spanning 1M–4M tokens, the memory-augmented agent framework yields a +9.48-point improvement. Moreover, it significantly enhances scientific reasoning capabilities and generalization in extended dialogues—demonstrating robust scalability and coherence over ultra-long horizons.
📝 Abstract
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.