🤖 AI Summary
Pretraining large language models faces an inherent trade-off between data quality and diversity, yet existing methods typically optimize these dimensions in isolation. Method: This paper proposes the first joint optimization framework that simultaneously models per-instance data quality and its complementary contribution to the overall distribution under a fixed training budget. We formulate quality and diversity as a unified objective, design a learnable parametric sampling function, and employ efficient hyperparameter search via small-model simulation coupled with LightGBM. We further introduce multi-dimensional quality assessment, domain-aware diversity metrics, and RegMix-inspired training simulation. Contribution/Results: Evaluated across multiple models and datasets, our approach achieves an average 7.2% improvement over strong baselines—including sequential quality filtering and diversity reweighting—demonstrating superior efficacy in balancing quality and diversity under constrained compute budgets.
📝 Abstract
Quality and diversity are two critical metrics for the training data of large language models (LLMs), positively impacting performance. Existing studies often optimize these metrics separately, typically by first applying quality filtering and then adjusting data proportions. However, these approaches overlook the inherent trade-off between quality and diversity, necessitating their joint consideration. Given a fixed training quota, it is essential to evaluate both the quality of each data point and its complementary effect on the overall dataset. In this paper, we introduce a unified data selection framework called QuaDMix, which automatically optimizes the data distribution for LLM pretraining while balancing both quality and diversity. Specifically, we first propose multiple criteria to measure data quality and employ domain classification to distinguish data points, thereby measuring overall diversity. QuaDMix then employs a unified parameterized data sampling function that determines the sampling probability of each data point based on these quality and diversity related labels. To accelerate the search for the optimal parameters involved in the QuaDMix framework, we conduct simulated experiments on smaller models and use LightGBM for parameters searching, inspired by the RegMix method. Our experiments across diverse models and datasets demonstrate that QuaDMix achieves an average performance improvement of 7.2% across multiple benchmarks. These results outperform the independent strategies for quality and diversity, highlighting the necessity and ability to balance data quality and diversity.