R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing data mixing strategies rely on predefined coarse-grained domains, failing to capture fine-grained semantic distinctions and incurring exponential computational overhead with increasing domain count. To address this, we propose a semantic-driven dynamic domain re-partitioning and gradient-guided data proportioning framework. Our method eliminates manual domain annotation by performing fine-grained, online regrouping based on semantic similarity. We further introduce the first adaptive data balancing mechanism that leverages the Gram matrix of domain-wise gradients—computed inherently during training—thus incurring no additional evaluation cost. We provide theoretical analysis establishing regularization properties of our approach. Evaluated across five diverse NLP, reasoning, and multimodal tasks, our method achieves performance on par with or surpassing state-of-the-art approaches while introducing only 0.01% extra computational overhead.

Technology Category

Application Category

📝 Abstract
Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via R&B, a framework that re-partitions training data based on semantic similarity (Regroup) to create finer-grained domains, and efficiently optimizes the data composition (Balance) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify R&B's effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of R&B on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, R&B matches or exceeds the performance of state-of-the-art data mixing strategies.
Problem

Research questions and friction points this paper is trying to address.

Improving data mixing for efficient foundation model training
Overcoming domain partitioning limitations in semantic nuances
Reducing computational costs in optimizing data composition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repartition data by semantic similarity for finer domains
Optimize data composition using gradient-induced Gram matrix
Minimal compute overhead with superior performance
🔎 Similar Papers
No similar papers found.
Albert Ge
Albert Ge
University of Wisconsin-Madison
Tzu-Heng Huang
Tzu-Heng Huang
Ph.D. student, University of Wisconsin-Madison
Data-centric AIData CurationMultimodal ModelsLLM Evaluation
J
John Cooper
University of Wisconsin-Madison
A
Avi Trost
University of Wisconsin-Madison
Z
Ziyi Chu
University of Wisconsin-Madison
S
Satya Sai Srinath Namburi Gnvv
University of Wisconsin-Madison
Z
Ziyang Cai
University of Wisconsin-Madison
K
Kendall Park
University of Wisconsin-Madison
Nicholas Roberts
Nicholas Roberts
PhD candidate UW-Madison
Machine LearningAutoMLdata-centric AI
Frederic Sala
Frederic Sala
Assistant Professor, University of Wisconsin
Data-centric AIMachine learningInformation theory