Probing RLVR training instability through the lens of objective-level hacking

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability commonly observed in reinforcement learning with value-based rewards (RLVR) when applied to Mixture-of-Experts (MoE) architectures, which manifests as an abnormally widened gap between training and inference performance due to misaligned token-level credit assignment—though the underlying mechanism has remained unclear. The authors propose a novel analytical framework termed “objective-level hacking” and, for the first time, attribute this instability to spurious signals embedded in the optimization objective. Through theoretical modeling and empirical validation on a 30-billion-parameter MoE model, they establish the causal role of these misleading signals. This study systematically uncovers the pathological dynamics of RLVR in MoE systems, offering both theoretical insights and practical guidance for designing stable and efficient reinforcement learning algorithms in large-scale sparse models.

Technology Category

Application Category

📝 Abstract
Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixture-of-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges from token-level credit misalignment and is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic in MoE models: the abnormal growth of the training-inference discrepancy, a phenomenon widely associated with instability but previously lacking a mechanistic explanation. These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.
Problem

Research questions and friction points this paper is trying to address.

RLVR
training instability
Mixture-of-Experts
objective-level hacking
credit misalignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

objective-level hacking
RLVR instability
Mixture-of-Experts
credit misalignment
training-inference discrepancy
🔎 Similar Papers
No similar papers found.
Yiming Dong
Yiming Dong
Qwen Team, Alibaba Group & Peking University
Machine LearningOptimization Methods
K
Kun Fu
Alibaba Cloud Computing
H
Haoyu Li
Alibaba Group
X
Xinyuan Zhu
Alibaba Cloud Computing
Yurou Liu
Yurou Liu
Renmin University of China
AI4Science
L
Lijing Shao
Kavli Institute for Astronomy and Astrophysics, Peking University; National Astronomical Observatories, Chinese Academy of Sciences
J
Jieping Ye
Alibaba Cloud Computing
Z
Zheng Wang
Alibaba Cloud Computing