🤖 AI Summary
To address the lack of semantic awareness and dynamic adaptability in data selection for large language model (LLM) pretraining, this paper proposes a two-stage topic-aware reweighting framework. In the first stage, fine-grained topic modeling is performed by integrating LDA and Top2Vec to characterize cross-domain semantic distributions. In the second stage, sample weights are dynamically adjusted based on gradient sensitivity analysis and online learning-state feedback, enabling curriculum-style updates and adaptive sampling. This approach overcomes the limitations of static quality scoring and fixed data mixing ratios, and—uniquely—couples fine-grained topic modeling with the model’s learning trajectory. Experiments across multiple domains demonstrate accelerated perplexity reduction during pretraining, average downstream task improvements of 2.1–4.7 percentage points, and an 18% reduction in computational overhead.
📝 Abstract
Pre-training large language models (LLMs) necessitates enormous diverse textual corpora, making effective data selection a key challenge for balancing computational resources and model performance. Current methodologies primarily emphasize data quality metrics and mixing proportions, yet they fail to adequately capture the underlying semantic connections between training samples and quality disparities within individual domains. We introduce ToReMi (Topic-based Reweighting for Model improvement), a novel two-stage framework that dynamically adjusts training sample weights according to their topical associations and observed learning patterns. Our comprehensive experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches, demonstrating accelerated perplexity reduction across multiple domains and enhanced capabilities on downstream evaluation tasks. Code is available at https://github.com/zxx000728/ToReMi.