Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the challenge of efficiently selecting high-quality data samples without relying on labeled rewards or training signals—a limitation of existing data selection methods. The authors propose SHIFT, a novel approach that leverages the variation in hidden states during a single inference pass of a large language model—referred to as Response-Invariant Representation Shift (RIRS)—as a training-agnostic proxy for sample utility. By integrating this signal with a quality-weighted CoreSet strategy, SHIFT constructs compact subsets that achieve high coverage and efficacy. Notably, the method requires neither reward annotations nor ground-truth answers, and under extremely low selection budgets, it substantially outperforms current training-free baselines on mathematical reasoning and medical question-answering tasks, while simultaneously improving in-domain accuracy and transfer performance to more challenging problems.

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) can yield large reasoning gains from very few training instances, yet its strong sensitivity to which instances are used makes data selection a central bottleneck. Most existing selection pipelines rely on training-time optimization signals and/or require access to verifiable rewards or ground-truth answers over large candidate pools, which is costly and often infeasible in specialized domains. We study RLVR data selection in a setting where selection must be performed before any RL training and without labels or reward evaluation on the full pool. We propose SHIFT, a one-shot, training-free selector based solely on inference-time hidden-state dynamics. For each candidate instance, SHIFT runs a single deterministic reasoning rollout and computes a reasoning-induced representation shift (RIRS) as the start-to-end hidden-state delta. SHIFT uses the RIRS magnitude as a lightweight proxy for instance utility and enforces coverage via a quality-weighted farthest-first CoreSet procedure in an RIRS-augmented feature space, producing compact subsets that scale to large unlabeled pools. Across mathematical reasoning and medical QA benchmarks under ultra-low budgets, SHIFT consistently outperforms training-free diversity and difficulty/uncertainty baselines, improving both in-domain accuracy and transfer to harder evaluation settings. Ablations show that RIRS-based coverage and quality-weighting contribute complementary gains, and analyses indicate that RIRS is not explained by simple input/output length statistics. Code is available at github.com/JianghaoWu/SHIFT.

Problem

Research questions and friction points this paper is trying to address.

RLVR

data selection

training-free

unlabeled pool

instance utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

hidden-state dynamics

data selection