Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models

📅 2025-08-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of optimizing data mixture ratios in supervised fine-tuning (SFT) across heterogeneous source datasets. We formulate mixture ratio optimization as a differentiable task aimed at minimizing validation loss—a novel framing in SFT. Our method integrates effective data transfer modeling, fine-grained loss parameterization, and fine-tuning scaling law analysis; it estimates key parameters via lightweight pre-experiments and analytically derives optimal mixing weights. We provide theoretical guarantees that the solution converges to an approximate global optimum. Empirically, our approach achieves validation loss within 0.66% (domain-averaged) of exhaustive grid search across all domains and downstream tasks. Moreover, reweighting standard SFT datasets using our optimized ratios consistently improves downstream performance. The core contribution is the first scalable, interpretable, and trial-free framework for data mixture optimization in SFT—eliminating the need for costly hyperparameter sweeps while preserving theoretical rigor and empirical efficacy.

Technology Category

Application Category

📝 Abstract
Optimizing data mixtures for supervised fine-tuning (SFT) of large language models (LLMs) is critical for developing general-purpose models, yet this area remains underexplored. In this paper, we frame data mixing as an optimization problem and introduce a novel method designed to minimize validation loss. Our approach parametrizes the loss by modeling effective data transferred and leveraging scaling laws for fine-tuning. By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0.66% higher than the best domain loss from grid search on average. Additionally, we show that reweighting popular SFT datasets using our method improves both validation loss and downstream performance. Finally, we discuss how our method can generalize to guide data selection for domain-specific models and provide insights into SFT.
Problem

Research questions and friction points this paper is trying to address.

Optimizing data mixtures for LLM fine-tuning
Minimizing validation loss via novel method
Improving domain-specific model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes data mixtures using validation loss minimization
Parametrizes loss via effective data transfer modeling
Applies scaling laws for fine-tuning optimization
🔎 Similar Papers
No similar papers found.