Data-Efficient RLVR via Off-Policy Influence Guidance

πŸ“… 2025-10-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In reinforcement learning from video (RLVR), data selection relies on heuristics, lacking theoretical guarantees and generalizability. Method: This paper proposes the first influence-function-based offline policy data selection method for RLVR. It introduces influence function theory to RLVR for the first time, integrating offline policy estimation with sparse random projection to enable efficient, scalable influence scoring. It further designs CROPIβ€”a multi-stage dynamic curriculum learning framework that iteratively selects samples exerting the largest influence on the current policy gradient. Contribution/Results: CROPI requires no online interaction; on a 1.5B-parameter model, it achieves a 2.66Γ— speedup in training steps, attaining full-data performance using only 10% of the dataset per stage. It significantly improves data efficiency and convergence speed, establishing the first theoretically grounded dynamic curriculum learning paradigm for RLVR.

Technology Category

Application Category

πŸ“ Abstract
Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop extbf{C}urriculum extbf{R}L with extbf{O}ff- extbf{P}olicy ext{I}nfluence guidance ( extbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.
Problem

Research questions and friction points this paper is trying to address.

Enhancing data selection efficiency in Reinforcement Learning with Verifiable Rewards
Overcoming computational costs of influence estimation for large language models
Developing theoretically-grounded data selection methods beyond heuristic approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy influence estimation using pre-collected trajectories
Sparse random projection for gradient dimensionality reduction
Multi-stage curriculum RL with iterative data selection
πŸ”Ž Similar Papers
E
Erle Zhu
CoAI Group, Tsinghua University
D
Dazhi Jiang
CoAI Group, Tsinghua University
Y
Yuan Wang
CoAI Group, Tsinghua University
X
Xujun Li
CoAI Group, Tsinghua University
J
Jiale Cheng
CoAI Group, Tsinghua University
Yuxian Gu
Yuxian Gu
Tsinghua University
Natural Language Processing
Yilin Niu
Yilin Niu
Tsinghua University
Natural Language Processing
Aohan Zeng
Aohan Zeng
Tsinghua University
Large Language ModelsNatural Language Processing
Jie Tang
Jie Tang
UW Madison
Computed Tomography
M
Minlie Huang
CoAI Group, Tsinghua University
Hongning Wang
Hongning Wang
Associate Professor, Department of Computer Science and Technology, Tsinghua University
Machine LearningInformation RetrievalLarge Language Models