MixMin: Finding Data Mixtures via Convex Minimization

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Determining optimal mixing ratios for multi-source data (e.g., pretraining corpora) remains challenging due to the unknown relationship between mixture weights and downstream performance. Method: Under a large-model assumption, we theoretically establish the convexity of the downstream task objective—measured by negative log-likelihood (NLL)—with respect to mixture weights. Leveraging this property, we propose MixMin: the first scalable, gradient-driven convex optimization framework for automatic data mixing. It employs bilevel optimization, few-shot mixed-gradient estimation, and model-agnostic weight learning. Contribution/Results: MixMin exhibits cross-scale transferability across model sizes. Experiments show that on Pythia-410M, it achieves 1–5% relative NLL reduction on four benchmarks (including PIQA) with only +0.2% computational overhead; in bioactivity prediction, mean accuracy improves by 0.03–0.15. Crucially, the learned mixing strategy demonstrates scale invariance, validating its generalizability.

Technology Category

Application Category

📝 Abstract
Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of 0.03-0.15.
Problem

Research questions and friction points this paper is trying to address.

Optimizing data mixtures for machine learning
Convex minimization in large model classes
Improving model performance with minimal compute
Innovation

Methods, ideas, or system contributions that make the work stand out.

Convex minimization for data mixing
Gradient-based optimization approach
Scale-invariant data mixture improvement
🔎 Similar Papers
No similar papers found.