QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address challenges in multi-hop reasoning and cross-document evidence linking under ultra-long contexts (up to 4M tokens), as well as training instability in long-sequence reinforcement learning (RL), this paper proposes: (1) a novel data synthesis pipeline tailored for long-context instruction tuning; (2) AEPO—an adaptive entropy-controlled policy optimization framework ensuring task-balanced, stable RL training; and (3) a memory-augmented, multi-stage fusion RL architecture. Evaluated on established long-context benchmarks, our approach matches or exceeds the performance of GPT-5 and Gemini-2.5-Pro, achieving an average gain of +9.90 points. On tasks with context lengths spanning 1M–4M tokens, the memory-augmented agent framework yields a +9.48-point improvement. Moreover, it significantly enhances scientific reasoning capabilities and generalization in extended dialogues—demonstrating robust scalability and coherence over ultra-long horizons.

Technology Category

Application Category

📝 Abstract

We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.

Problem

Research questions and friction points this paper is trying to address.

Develops a data synthesis pipeline for generating challenging long-context reasoning tasks.

Introduces stabilized reinforcement learning to overcome training instability in long-context models.

Proposes a memory-augmented architecture for processing ultra-long sequences exceeding 4M tokens.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-context data synthesis pipeline for reasoning tasks

Stabilized reinforcement learning with adaptive policy optimization

Memory-augmented architecture for ultra-long context processing

🔎 Similar Papers

Long-context Language Models Cannot Retrieve Without Sufficient Steps