🤖 AI Summary
This work addresses the inefficiency of existing data selection methods in reinforcement learning, which rely solely on difficulty-based heuristics and neglect epistemic uncertainty. To overcome this limitation, the authors propose InSight, a novel approach that decomposes the uncertainty in task success rates—estimated via a Bayesian latent variable model—into difficulty-related and evidence-dependent components. Building upon this decomposition, InSight introduces a stable data scoring mechanism grounded in a weighted mutual information objective. Notably, this is the first method to integrate weighted mutual information with Bayesian belief modeling for data selection in reinforcement learning with verifiable rewards (RLVR), naturally accommodating multi-trajectory scenarios. Experimental results demonstrate that InSight achieves average performance gains of +1.41 on planning and mathematical reasoning benchmarks and +1.01 on general reasoning tasks, while accelerating training by up to 2.2× with negligible computational overhead.
📝 Abstract
Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.