π€ AI Summary
In reinforcement learning from video (RLVR), data selection relies on heuristics, lacking theoretical guarantees and generalizability. Method: This paper proposes the first influence-function-based offline policy data selection method for RLVR. It introduces influence function theory to RLVR for the first time, integrating offline policy estimation with sparse random projection to enable efficient, scalable influence scoring. It further designs CROPIβa multi-stage dynamic curriculum learning framework that iteratively selects samples exerting the largest influence on the current policy gradient. Contribution/Results: CROPI requires no online interaction; on a 1.5B-parameter model, it achieves a 2.66Γ speedup in training steps, attaining full-data performance using only 10% of the dataset per stage. It significantly improves data efficiency and convergence speed, establishing the first theoretically grounded dynamic curriculum learning paradigm for RLVR.
π Abstract
Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop extbf{C}urriculum extbf{R}L with extbf{O}ff- extbf{P}olicy ext{I}nfluence guidance ( extbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.