🤖 AI Summary
This work addresses the instability commonly observed in reinforcement learning with value-based rewards (RLVR) when applied to Mixture-of-Experts (MoE) architectures, which manifests as an abnormally widened gap between training and inference performance due to misaligned token-level credit assignment—though the underlying mechanism has remained unclear. The authors propose a novel analytical framework termed “objective-level hacking” and, for the first time, attribute this instability to spurious signals embedded in the optimization objective. Through theoretical modeling and empirical validation on a 30-billion-parameter MoE model, they establish the causal role of these misleading signals. This study systematically uncovers the pathological dynamics of RLVR in MoE systems, offering both theoretical insights and practical guidance for designing stable and efficient reinforcement learning algorithms in large-scale sparse models.
📝 Abstract
Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixture-of-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges from token-level credit misalignment and is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic in MoE models: the abnormal growth of the training-inference discrepancy, a phenomenon widely associated with instability but previously lacking a mechanistic explanation. These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.