GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

231K/year
🤖 AI Summary
Existing long-context reinforcement learning approaches suffer from task homogenization and misaligned reward signals due to their overreliance on complex retrieval pathways. To address these limitations, this work proposes a capability-oriented long-context reinforcement learning framework. We introduce RLVR, an open-source dataset spanning nine diverse task categories, and develop the TMN-Reweight mechanism, which combines task-level mean normalization with difficulty-adaptive weighting to effectively balance heterogeneous rewards across tasks. Leveraging the GRPO training framework and capability-driven data construction, a Qwen3-30B-A3B model trained solely on RLVR surpasses the closed-source QwenLong-L1.5 and achieves performance comparable to state-of-the-art models such as DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507.
📝 Abstract
We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.
Problem

Research questions and friction points this paper is trying to address.

long-context reinforcement learning
reward formulation
task heterogeneity
capability-oriented learning
multitask alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

capability-oriented RL
long-context reinforcement learning
multitask alignment
TMN-Reweight
verifiable rewards
🔎 Similar Papers
No similar papers found.