Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses the limited generalization of existing reinforcement learning–based post-training methods for large language models, which often rely on gold labels or domain-specific verifiers. The authors propose VIGOR, a novel approach that introduces, for the first time, an intrinsic reward signal derived from the ℓ² norm of the negative log-likelihood gradient of the policy model under teacher forcing—eliminating the need for external validation. By incorporating √T length normalization and intra-batch ranking shaping, VIGOR enables stable reinforcement learning optimization across diverse tasks without any external supervision. Empirically, the method achieves an average accuracy gain of 3.31% on mathematical reasoning benchmarks and improves performance by 1.91% on code-related tasks despite being trained solely on mathematical data, demonstrating significantly enhanced generalization and training stability.
📝 Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising post-training paradigm for Large Language Models (LLMs), its dependency on the gold label or domain-specific verifiers limits its scalability to new tasks and domains. In this work, we propose Verifier-free Intrinsic Gradient-Norm Reward (VIGOR), a simple reward that uses only the policy model itself. Given a prompt, VIGOR samples a group of completions and assigns higher within-group rewards to outputs that induce smaller $\ell_2$ norms of the teacher-forced negative log-likelihood gradients under the current parameters. Intuitively, lower gradient norms suggest the completion aligns better with the current policy, serving as an intrinsic preference signal for policy optimization. To make this intrinsic signal practical for RL, we correct the systematic length bias of averaged token-level gradients with a $\sqrt{T}$ scaling, and apply group-wise rank shaping to stabilize reward scales across prompts. Across mathematical reasoning benchmarks, VIGOR outperforms the state-of-the-art Reinforcement Learning from Internal Feedback (RLIF) baseline, and it also exhibits cross-domain transfer to code benchmarks when trained only on math data. For instance, on Qwen2.5-7B-Base post-trained on MATH, VIGOR improves the average math accuracy by +3.31% and the average code accuracy by +1.91% over this baseline, while exhibiting more stable training dynamics. The code is available at https://github.com/ZJUSCL/VIGOR.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Large Language Models
Verifier-Free
Scalability
Post-Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifier-free RL
Intrinsic Reward
Gradient-Norm Reward
Cross-domain Transfer
LLM Alignment
🔎 Similar Papers