🤖 AI Summary
To address performance optimization under limited data budgets in large language model (LLM) fine-tuning, this work formulates data selection as a tractable Markov decision process (MDP) — the first such formulation — and introduces a reinforcement learning (RL) framework for dynamic, adaptive data filtering. Methodologically, a lightweight surrogate model generates scalable reward signals to guide RL algorithms (e.g., PPO) in autonomously learning optimal sampling policies. The core contribution is the design of the first end-to-end trainable MDP-based data selection paradigm, balancing theoretical solvability with practical efficiency. Experiments across four downstream tasks demonstrate that our approach achieves comparable or superior performance to full-data fine-tuning using only 5% of the training data, with up to a 10.8-percentage-point accuracy gain and up to a 2× reduction in training time.
📝 Abstract
Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally intractable, and existing approximate approaches are pretraining-oriented and transfer poorly to the fine-tuning setting. We reformulate this problem as a tractable Markov Decision Process (MDP) and train agents using various Reinforcement Learning (RL) methods to learn optimal data selection policies, guided by an efficient, proxy-model-based reward signal. Across four datasets, training on a $5%$ subset selected by our approach matches or outperforms fine-tuning on the full dataset by up to $10.8$ accuracy points, while cutting wall-clock training time by up to $2 imes$, highlighting the promise of RL-guided data selection.