Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses a key limitation in existing data selection methods, which typically employ fixed selection ratios and overlook the impact of dynamically adjusting data volume on training efficiency and generalization. The authors propose PODS, a plug-and-play oscillating data scheduling framework that extends data selection from “what to select” to “how much to select.” By alternately applying low-ratio regularization phases and high-ratio recovery phases, PODS dynamically balances optimization fidelity and implicit regularization. The framework is lightweight, task-agnostic, and compatible with diverse static and dynamic selection strategies as well as model architectures. Experiments demonstrate that PODS reduces training costs by 50% while improving accuracy on ImageNet-1k and accelerates instruction fine-tuning of large language models by over 2× without any performance degradation.

📝 Abstract

Data selection accelerates training by identifying representative training data while preserving model performance. However, existing methods mainly focus on designing sample-importance criteria, i.e., deciding what to select, while typically fixing the selected data volume as the target ratio throughout training. Thus, they are often dynamic in sample identity but static in data volume. In this work, we revisit data selection from an optimization perspective and show that selected-data training induces an implicit regularization effect modulated by the instantaneous selection ratio. This reveals a key trade-off: lower ratios amplify selection-induced regularization, whereas higher ratios preserve data coverage and optimization fidelity. Motivated by this insight, we propose PODS, a Plug-and-play Oscillatory Data-volume Scheduling framework. Rather than introducing another sample-scoring metric, PODS serves as a lightweight module that dynamically schedules how much data to select over training. Under the target selection ratio, PODS alternates between low-ratio regularization phases and high-ratio recovery phases to exploit selection-induced regularization without sacrificing optimization stability. With its lightweight, ratio-level, and task-agnostic design, PODS is compatible with existing static and dynamic selection methods and broadly applicable across training paradigms. Experiments across various datasets, architectures, and tasks show that PODS consistently improves the efficiency-generalization trade-off, e.g., reducing ImageNet-1k training cost by 50% with improved accuracy and accelerating LLM instruction tuning by over 2x without performance degradation.

Problem

Research questions and friction points this paper is trying to address.

data selection

data-volume scheduling

training efficiency

regularization

optimization fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

data selection

oscillatory scheduling

implicit regularization