No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM reinforcement learning methods (e.g., GRPO) discard “zero-variance prompts”—inputs for which all model responses receive identical rewards—thereby overlooking valuable learning signals. Method: This paper systematically investigates the optimization potential in such prompts and proposes entropy-guided advantage shaping: leveraging token-level entropy to quantify response certainty, enabling fine-grained discrimination between correct and incorrect outputs even when response-level rewards are indistinguishable; entropy is integrated as a modulation factor into the advantage function within a verifiable reward framework, extending GRPO’s algorithmic structure. Contribution/Results: Evaluated on six mathematical reasoning benchmarks, our method achieves up to +8.61 points in accuracy and +7.77 points in pass rate over baseline GRPO, significantly outperforming zero-variance prompt filtering. The core contribution is the formal identification and modeling of informational value in zero-variance prompts, establishing a novel paradigm for LLM reinforcement learning.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.
Problem

Research questions and friction points this paper is trying to address.

Exploiting zero-variance prompts in LLM reinforcement learning
Extracting learning signals from uniformly rewarded responses
Improving reasoning accuracy via entropy-guided advantage shaping
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL-ZVP algorithm extracts learning from zero-variance prompts
Modulates feedback using token-level characteristics for nuanced signals
Directly rewards correctness and penalizes errors without contrasting responses
🔎 Similar Papers
No similar papers found.