Unsupervised Topic Models are Data Mixers for Pre-training Language Models

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Existing LLM pretraining approaches lack fine-grained, topic-aware data mixing across heterogeneous sources. To address this, we propose DataWeave—a fully unsupervised data mixing framework grounded in topic modeling. DataWeave automatically partitions corpora at the topical level via multi-stage semantic clustering and LLM-driven topic generation, then applies a heuristic up- and down-sampling strategy for dynamic, topic-aware sampling. Crucially, it is the first work to directly employ unsupervised topic models as data mixers, eliminating reliance on manually curated domain labels. Experiments demonstrate that fine-grained topics—particularly *Science* and *Relationships*—yield the most substantial gains in model performance, consistently outperforming state-of-the-art domain-level mixing methods. DataWeave achieves new SOTA results across multiple downstream benchmarks. The code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

The performance of large language models (LLMs) is significantly affected by the quality and composition of their pre-training data, which is inherently diverse, spanning various domains, sources, and topics. Effectively integrating these heterogeneous data sources is crucial for optimizing LLM performance. Previous research has predominantly concentrated on domain-based data mixing, often neglecting the nuanced topic-level characteristics of the data. To address this gap, we propose a simple yet effective topic-based data mixing strategy that utilizes fine-grained topics generated through our topic modeling method, DataWeave. DataWeave employs a multi-stage clustering process to group semantically similar documents and utilizes LLMs to generate detailed topics, thereby facilitating a more nuanced understanding of dataset composition. Our strategy employs heuristic methods to upsample or downsample specific topics, which significantly enhances LLM performance on downstream tasks, achieving superior results compared to previous, more complex data mixing approaches. Furthermore, we confirm that the topics Science and Relationships are particularly effective, yielding the most substantial performance improvements. We will make our code and datasets publicly available.

Problem

Research questions and friction points this paper is trying to address.

Enhances LLM performance through topic-based data mixing.

Utilizes DataWeave for fine-grained topic modeling.

Improves downstream tasks with heuristic topic sampling.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Topic-based data mixing strategy

Multi-stage clustering for documents

Heuristic methods for topic sampling

🔎 Similar Papers

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models