π€ AI Summary
Traditional reinforcement learning relies on sparse, all-or-nothing rewards, making it ill-suited for partially verifiable tasks such as multi-requirement instruction following. This work proposes Soft-RLVR, a framework that decomposes task instructions into atomic checklist items and leverages a large language model to assign fine-grained soft rewards based on individual item satisfaction. We further introduce Soft-SVeRL, a self-verification variant wherein the policy model also serves as its own reward validator. For the first time, we formally analyze the trade-off between partial credit assignment and verification noise, and propose an explicit stabilization mechanism to mitigate reward inflation in self-verification. On the IFEval benchmark, our approach achieves an 11.1-point performance gain using only learned verification-based rewards, demonstrating the critical importance of checklist decomposition, verifier quality, and reward stabilization.
π Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, responses may satisfy some but not all of them, or no single reference answer might exist. We introduce Soft-RLVR, a framework for reinforcement learning from decomposed, learned verification signals. Soft-RLVR converts each prompt into a checklist of atomic requirements, scores candidate responses item by item with an LLM verifier, and trains on the resulting soft reward. Checklist-based rewards turn sparse pass/fail supervision into a denser partial-credit signal, but they also introduce a tradeoff: averaging item-level judgments can reduce verifier noise, while partial credit can reward incomplete responses. We formalize this tradeoff and identify conditions under which checklist-based verification gives a more reliable RL training signal than holistic verification. We further introduce Soft-SVeRL, a self-verifying variant of Soft-RLVR in which the policy also acts as the verifier. We show that self-verification is prone to reward inflation from overly permissive self-judgments, and that explicit stabilization is needed to prevent this collapse. In a controlled instruction-following setting with rule-based ground-truth evaluation, checklist-based Soft-RLVR improves IFEval by up to 11.1 points using only learned verifier rewards. Our experiments further show that verifier quality and checklist quality both affect downstream RL outcomes, and that explicit stabilization is essential for effective self-verification.