🤖 AI Summary
This work addresses the degradation in accuracy and coherence commonly observed in large language models when handling long-context tasks, a challenge exacerbated by exposure bias, sparse rewards, and the difficulty of directly optimizing arbitrary reward signals. To overcome these limitations, the authors propose dGRPO, a novel approach that unifies Group Relative Policy Optimization (GRPO) with Online Policy Distillation (OPD) from a stronger teacher model into a single objective, enabling synergistic outcome-driven policy optimization and knowledge transfer. The study also introduces LongBlocks, a synthetic dataset designed to support multi-hop reasoning and evaluation of long-form text generation. Experimental results demonstrate that dGRPO substantially outperforms baseline methods such as supervised fine-tuning and vanilla GRPO, achieving significant gains in long-context performance while preserving short-context capabilities, thereby establishing an efficient and stable new paradigm for long-context alignment.
📝 Abstract
Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as supervised fine-tuning (SFT) and knowledge distillation (KD) suffer from exposure bias and limited recovery from model-generated errors over long horizons; 2) on-policy reinforcement learning methods such as Group Relative Policy Optimization (GRPO) better align training with model-generated states, but are unstable and sample-inefficient due to sparse rewards; 3) on-policy distillation (OPD) provides dense token-level guidance, but does not directly optimize arbitrary reward signals. In this paper, we propose Distilled Group Relative Policy Optimization (dGRPO), a method for long-context reasoning that augments GRPO with dense guidance from a stronger teacher via OPD. We also introduce LongBlocks, a synthetic long-context dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. We conduct extensive experiments and ablations comparing off-policy training, sparse-reward GRPO, and our combined approach, leading to an improved recipe for long-context alignment. Overall, our results show that combining outcome-based policy optimization with knowledge distillation in a single objective provides a more stable and effective path to long-context reasoning, while preserving short-context capabilities.