🤖 AI Summary
Large-scale pretraining often relies on massive yet low-diversity datasets, leading to inefficiency and resource waste. This work proposes a diversity-driven sampling strategy to systematically construct smaller yet more representative pretraining corpora and presents the first empirical evaluation of data diversity’s impact on ModernBERT-style models. Experiments based on the French ModernBERT architecture demonstrate that training on only 150M tokens of diverse data can outperform equally sized randomly sampled data by up to 10 performance points on certain tasks. Moreover, the model achieves comparable performance to that of large-scale random pretraining in just 483 training hours—significantly less than the original 1,775 hours—thereby substantially improving training efficiency and resource utilization.
📝 Abstract
Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining at least comparable performance. We compare diversity-driven sampling algorithms, so as to pick the best one. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.