🤖 AI Summary
This work identifies Lazy Likelihood Displacement (LLD) as the fundamental mechanism underlying training collapse in Group Relative Policy Optimization (GRPO) for tool-integrated reinforcement learning: simultaneous decay of likelihoods for both correct and incorrect responses, triggering an “LLD death spiral.” To address this, we propose LLDS—a fine-grained likelihood-preserving regularization that activates only when likelihood decreases and applies exclusively to critical tokens, integrating dynamic likelihood monitoring and gradient modulation. LLDS is the first method to explicitly identify LLD as the core cause of GRPO failure in search-augmented reasoning, enabling lightweight, precise, and adaptive training stabilization. Evaluated on seven open-domain and multi-hop question-answering benchmarks, LLDS improves Qwen2.5-3B and Qwen2.5-7B by 37.8% and 32.0%, respectively, while substantially mitigating gradient explosion and enabling stable, scalable training.
📝 Abstract
Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.