No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing LLM reinforcement learning methods (e.g., GRPO) discard “zero-variance prompts”—inputs for which all model responses receive identical rewards—thereby overlooking valuable learning signals. Method: This paper systematically investigates the optimization potential in such prompts and proposes entropy-guided advantage shaping: leveraging token-level entropy to quantify response certainty, enabling fine-grained discrimination between correct and incorrect outputs even when response-level rewards are indistinguishable; entropy is integrated as a modulation factor into the advantage function within a verifiable reward framework, extending GRPO’s algorithmic structure. Contribution/Results: Evaluated on six mathematical reasoning benchmarks, our method achieves up to +8.61 points in accuracy and +7.77 points in pass rate over baseline GRPO, significantly outperforming zero-variance prompt filtering. The core contribution is the formal identification and modeling of informational value in zero-variance prompts, establishing a novel paradigm for LLM reinforcement learning.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.

Problem

Research questions and friction points this paper is trying to address.

Exploiting zero-variance prompts in LLM reinforcement learning

Extracting learning signals from uniformly rewarded responses

Improving reasoning accuracy via entropy-guided advantage shaping

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL-ZVP algorithm extracts learning from zero-variance prompts

Modulates feedback using token-level characteristics for nuanced signals

Directly rewards correctness and penalizes errors without contrasting responses

🔎 Similar Papers

QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning