Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of weak gradient signals in reinforcement learning when handling extremely difficult or trivial prompts, which leads to slow convergence of language models trained from scratch in low-success regimes. The authors propose and theoretically analyze an asymmetric prompt weighting mechanism that assigns higher gradient weights to prompts with low empirical success rates—even zero—and derive the theoretically optimal form of these weights. Integrated into a verifiable reward–based reinforcement learning framework, this approach is compatible with zero-shot training paradigms such as R1-Zero. It significantly accelerates convergence in sparse-reward, response-cost-dominated settings, particularly enhancing optimization efficiency during early training stages characterized by low accuracy.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
verifiable rewards
prompt weighting
asymmetric weighting
low-success prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

asymmetric prompt weighting
verifiable rewards
reinforcement learning
low-success regime
gradient weighting
🔎 Similar Papers
No similar papers found.