SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM pretraining data mixing predominantly employs coarse-grained domain-level weighting, neglecting inter-domain overlaps and sample-level heterogeneity—thus failing to ensure balanced global diversity and quality. To address this, we propose the first bottom-up, sample-level mixing paradigm: it jointly models per-sample quality and cross-domain diversity, and derives an optimal sampling distribution via global optimization. Unlike conventional within-domain uniform sampling, our approach enables fine-grained, cross-domain collaborative data selection. Experiments demonstrate that our method significantly outperforms state-of-the-art domain-weighting baselines across multiple downstream tasks and perplexity metrics. Notably, it achieves baseline performance in only 1.4–2.1× fewer training steps, substantiating substantial improvements in both pretraining efficiency and effectiveness through sample-level mixing.

Technology Category

Application Category

📝 Abstract
Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in domain-wise pretraining data mixing for LLMs.
Proposes sample-wise mixing to optimize data quality and diversity.
Demonstrates improved performance across tasks with SampleMix strategy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sample-wise data mixing for LLMs
Global cross-domain sampling strategy
Dynamic domain distribution optimization
🔎 Similar Papers
No similar papers found.
Xiangyu Xi
Xiangyu Xi
Peking University; Meituan Group
natural language processingevent extractioninformation extractiontask-oriented dialogue
Deyang Kong
Deyang Kong
Peking University
Natural Language Processing
J
Jian Yang
Meituan Group, Beijing, China
J
Jiawei Yang
Meituan Group, Beijing, China
Z
Zhengyu Chen
Meituan Group, Beijing, China
W
Wei Wang
Meituan Group, Beijing, China
Jingang Wang
Jingang Wang
Meituan
Information RetrievalNatural Language ProcessingMachine Translation
X
Xunliang Cai
Meituan Group, Beijing, China
Shikun Zhang
Shikun Zhang
北京大学
W
Wei Ye
National Engineering Research Center for Software Engineering, Peking University, Beijing, China