Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the sparse reward problem and unstable policy gradients in Reinforcement Learning via Verification (RLVR) for complex reasoning tasks such as mathematics and programming, this paper proposes PACS: a novel framework that reformulates verifiable rewards as predictable labels, thereby transforming RLVR entirely into a supervised learning paradigm. PACS employs a scoring function parameterized by the policy model and optimized via cross-entropy loss, implicitly coupling actor-critic mechanisms without explicit value networks or policy gradient updates—thus eliminating high gradient variance and training instability inherent in conventional RL. This design significantly improves training efficiency and generalization. On the AIME 2025 mathematical reasoning benchmark, PACS achieves 59.78% pass@256, outperforming PPO and GRPO by 13.32–14.36 percentage points and establishing a new state-of-the-art for RLVR methods.

Technology Category

Application Category

📝 Abstract

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose $ extbf{PACS}$, a novel RLVR framework that achieves im$ extbf{P}$licit $ extbf{A}$ctor $ extbf{C}$ritic coupling via a $ extbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

Problem

Research questions and friction points this paper is trying to address.

Addresses sparse reward signals in RLVR training

Solves unstable policy gradient updates in RL

Reformulates RLVR as supervised learning task

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised learning reformulates RLVR problem

Cross-entropy loss optimizes score function

Implicit actor-critic coupling stabilizes training

🔎 Similar Papers

No similar papers found.