Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

πŸ“… 2025-09-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the sparse reward problem and unstable policy gradients in Reinforcement Learning via Verification (RLVR) for complex reasoning tasks such as mathematics and programming, this paper proposes PACS: a novel framework that reformulates verifiable rewards as predictable labels, thereby transforming RLVR entirely into a supervised learning paradigm. PACS employs a scoring function parameterized by the policy model and optimized via cross-entropy loss, implicitly coupling actor-critic mechanisms without explicit value networks or policy gradient updatesβ€”thus eliminating high gradient variance and training instability inherent in conventional RL. This design significantly improves training efficiency and generalization. On the AIME 2025 mathematical reasoning benchmark, PACS achieves 59.78% pass@256, outperforming PPO and GRPO by 13.32–14.36 percentage points and establishing a new state-of-the-art for RLVR methods.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose $ extbf{PACS}$, a novel RLVR framework that achieves im$ extbf{P}$licit $ extbf{A}$ctor $ extbf{C}$ritic coupling via a $ extbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.
Problem

Research questions and friction points this paper is trying to address.

Addresses sparse reward signals in RLVR training
Solves unstable policy gradient updates in RL
Reformulates RLVR as supervised learning task
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised learning reformulates RLVR problem
Cross-entropy loss optimizes score function
Implicit actor-critic coupling stabilizes training
πŸ”Ž Similar Papers
No similar papers found.
J
Jiaming Li
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Longze Chen
Longze Chen
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Natural Language Processing
Z
Ze Gong
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Yukun Chen
Yukun Chen
Pieces Technologies Inc.
Natural Language Processing
L
Lu Wang
Ritzz-AI
W
Wanwei He
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Run Luo
Run Luo
University of Chinese Academy of Sciences
text&video&audio pretrainingvlmvlarl3dv
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding