QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in multi-hop reasoning and cross-document evidence linking under ultra-long contexts (up to 4M tokens), as well as training instability in long-sequence reinforcement learning (RL), this paper proposes: (1) a novel data synthesis pipeline tailored for long-context instruction tuning; (2) AEPO—an adaptive entropy-controlled policy optimization framework ensuring task-balanced, stable RL training; and (3) a memory-augmented, multi-stage fusion RL architecture. Evaluated on established long-context benchmarks, our approach matches or exceeds the performance of GPT-5 and Gemini-2.5-Pro, achieving an average gain of +9.90 points. On tasks with context lengths spanning 1M–4M tokens, the memory-augmented agent framework yields a +9.48-point improvement. Moreover, it significantly enhances scientific reasoning capabilities and generalization in extended dialogues—demonstrating robust scalability and coherence over ultra-long horizons.

Technology Category

Application Category

📝 Abstract
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
Problem

Research questions and friction points this paper is trying to address.

Develops a data synthesis pipeline for generating challenging long-context reasoning tasks.
Introduces stabilized reinforcement learning to overcome training instability in long-context models.
Proposes a memory-augmented architecture for processing ultra-long sequences exceeding 4M tokens.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-context data synthesis pipeline for reasoning tasks
Stabilized reinforcement learning with adaptive policy optimization
Memory-augmented architecture for ultra-long context processing
🔎 Similar Papers
No similar papers found.
Weizhou Shen
Weizhou Shen
Tongyi Lab, Alibaba Group
Z
Ziyi Yang
Tongyi Lab, Alibaba Group
C
Chenliang Li
Tongyi Lab, Alibaba Group
Z
Zhiyuan Lu
Tongyi Lab, Alibaba Group
Miao Peng
Miao Peng
The Hong Kong University of Science and Technology (Guangzhou)
Knowledge GraphNatural Language Processing
Huashan Sun
Huashan Sun
Beijing Institute of Technology
AINLP
Y
Yingcheng Shi
Tongyi Lab, Alibaba Group
S
Shengyi Liao
Tongyi Lab, Alibaba Group
S
Shaopeng Lai
Tongyi Lab, Alibaba Group
B
Bo Zhang
Tongyi Lab, Alibaba Group
D
Dayiheng Liu
Tongyi Lab, Alibaba Group
F
Fei Huang
Tongyi Lab, Alibaba Group
Jingren Zhou
Jingren Zhou
Alibaba Group, Microsoft
Cloud ComputingLarge Scale Distributed SystemsMachine LearningQuery ProcessingQuery
M
Ming Yan
Tongyi Lab, Alibaba Group