Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of weak gradient signals in reinforcement learning when handling extremely difficult or trivial prompts, which leads to slow convergence of language models trained from scratch in low-success regimes. The authors propose and theoretically analyze an asymmetric prompt weighting mechanism that assigns higher gradient weights to prompts with low empirical success rates—even zero—and derive the theoretically optimal form of these weights. Integrated into a verifiable reward–based reinforcement learning framework, this approach is compatible with zero-shot training paradigms such as R1-Zero. It significantly accelerates convergence in sparse-reward, response-cost-dominated settings, particularly enhancing optimization efficiency during early training stages characterized by low accuracy.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

verifiable rewards

prompt weighting

asymmetric weighting

low-success prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

asymmetric prompt weighting

verifiable rewards

reinforcement learning