RL-Guided Data Selection for Language Model Finetuning

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

To address performance optimization under limited data budgets in large language model (LLM) fine-tuning, this work formulates data selection as a tractable Markov decision process (MDP) — the first such formulation — and introduces a reinforcement learning (RL) framework for dynamic, adaptive data filtering. Methodologically, a lightweight surrogate model generates scalable reward signals to guide RL algorithms (e.g., PPO) in autonomously learning optimal sampling policies. The core contribution is the design of the first end-to-end trainable MDP-based data selection paradigm, balancing theoretical solvability with practical efficiency. Experiments across four downstream tasks demonstrate that our approach achieves comparable or superior performance to full-data fine-tuning using only 5% of the training data, with up to a 10.8-percentage-point accuracy gain and up to a 2× reduction in training time.

Technology Category

Application Category

📝 Abstract

Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally intractable, and existing approximate approaches are pretraining-oriented and transfer poorly to the fine-tuning setting. We reformulate this problem as a tractable Markov Decision Process (MDP) and train agents using various Reinforcement Learning (RL) methods to learn optimal data selection policies, guided by an efficient, proxy-model-based reward signal. Across four datasets, training on a $5%$ subset selected by our approach matches or outperforms fine-tuning on the full dataset by up to $10.8$ accuracy points, while cutting wall-clock training time by up to $2 imes$, highlighting the promise of RL-guided data selection.

Problem

Research questions and friction points this paper is trying to address.

Optimizing data selection for LLM fine-tuning under budget constraints

Reformulating data selection as tractable Markov Decision Process

Using RL methods to improve accuracy while reducing training time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates data selection as MDP problem

Uses RL agents to learn selection policies

Employs proxy-model reward for efficient training

🔎 Similar Papers

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining