Data-Efficient RLVR via Off-Policy Influence Guidance

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

In reinforcement learning from video (RLVR), data selection relies on heuristics, lacking theoretical guarantees and generalizability. Method: This paper proposes the first influence-function-based offline policy data selection method for RLVR. It introduces influence function theory to RLVR for the first time, integrating offline policy estimation with sparse random projection to enable efficient, scalable influence scoring. It further designs CROPI—a multi-stage dynamic curriculum learning framework that iteratively selects samples exerting the largest influence on the current policy gradient. Contribution/Results: CROPI requires no online interaction; on a 1.5B-parameter model, it achieves a 2.66× speedup in training steps, attaining full-data performance using only 10% of the dataset per stage. It significantly improves data efficiency and convergence speed, establishing the first theoretically grounded dynamic curriculum learning paradigm for RLVR.

Technology Category

Application Category

📝 Abstract

Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop extbf{C}urriculum extbf{R}L with extbf{O}ff- extbf{P}olicy ext{I}nfluence guidance ( extbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.

Problem

Research questions and friction points this paper is trying to address.

Enhancing data selection efficiency in Reinforcement Learning with Verifiable Rewards

Overcoming computational costs of influence estimation for large language models

Developing theoretically-grounded data selection methods beyond heuristic approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy influence estimation using pre-collected trajectories

Sparse random projection for gradient dimensionality reduction

Multi-stage curriculum RL with iterative data selection

🔎 Similar Papers

Efficient Off-Policy Learning for High-Dimensional Action Spaces