When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge in reinforcement learning where reliance on ground-truth labels for verifiable rewards incurs high annotation costs, while unsupervised approaches often suffer from training instability and substantial variation in sample utility. To overcome these limitations, the paper proposes the RLAVR framework, which introduces a novel Corrective Advantage Gap (CAG) metric to assess the supervisory value of samples. Building on this, it devises a Correction-Aware Reliability Estimation (CARE) strategy that actively selects a small subset of high-value samples for ground-truth labeling and integrates these with pseudo-labels during training. Under constrained annotation budgets, this approach significantly enhances both training stability and task performance, demonstrating consistent effectiveness and broad generalizability across multiple domains, model architectures, and scales.

📝 Abstract

Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Verifiable Rewards

Label Acquisition

Training Collapse

Annotation Cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Label Acquisition

Reinforcement Learning with Verifiable Rewards

Corrective Advantage Gap