ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of semantic awareness and dynamic adaptability in data selection for large language model (LLM) pretraining, this paper proposes a two-stage topic-aware reweighting framework. In the first stage, fine-grained topic modeling is performed by integrating LDA and Top2Vec to characterize cross-domain semantic distributions. In the second stage, sample weights are dynamically adjusted based on gradient sensitivity analysis and online learning-state feedback, enabling curriculum-style updates and adaptive sampling. This approach overcomes the limitations of static quality scoring and fixed data mixing ratios, and—uniquely—couples fine-grained topic modeling with the model’s learning trajectory. Experiments across multiple domains demonstrate accelerated perplexity reduction during pretraining, average downstream task improvements of 2.1–4.7 percentage points, and an 18% reduction in computational overhead.

Technology Category

Application Category

📝 Abstract
Pre-training large language models (LLMs) necessitates enormous diverse textual corpora, making effective data selection a key challenge for balancing computational resources and model performance. Current methodologies primarily emphasize data quality metrics and mixing proportions, yet they fail to adequately capture the underlying semantic connections between training samples and quality disparities within individual domains. We introduce ToReMi (Topic-based Reweighting for Model improvement), a novel two-stage framework that dynamically adjusts training sample weights according to their topical associations and observed learning patterns. Our comprehensive experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches, demonstrating accelerated perplexity reduction across multiple domains and enhanced capabilities on downstream evaluation tasks. Code is available at https://github.com/zxx000728/ToReMi.
Problem

Research questions and friction points this paper is trying to address.

Dynamic pre-training data selection for LLMs
Balancing computational resources and model performance
Addressing semantic connections and domain quality disparities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic pre-training data selection framework
Topic-aware reweighting for model improvement
Two-stage weight adjustment by topical associations
🔎 Similar Papers
No similar papers found.
X
Xiaoxuan Zhu
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
Zhouhong Gu
Zhouhong Gu
Fudan University
Language ModelingAutomated SocietyModel Editing
S
Suhang Zheng
Alibaba Group
T
Tao Wang
Alibaba Group
T
Tianyu Li
Alibaba Group
Hongwei Feng
Hongwei Feng
Fudan University
knowledge management,AI,big data
Y
Yanghua Xiao
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University