Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

๐Ÿ“… 2025-09-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing perplexity (PPL)-based data filtering methods for large language model pretraining suffer from high computational overhead and poor robustness to noise and out-of-distribution samples. To address these limitations, this paper proposes a model-free token prior probability filtering method. It models lexical density and token role characteristics via corpus-level token frequency statistics, and employs meanโ€“standard deviation thresholds combined with linguistically motivated heuristics to enable efficient, stable, and inference-free document selection. Compared to PPL-based approaches, our method achieves over 1000ร— speedup while attaining state-of-the-art average performance across 20 downstream tasks. Moreover, it demonstrates strong generalization to code, mathematical notation, and multilingual text. The proposed approach significantly enhances the efficiency, robustness, and applicability of data curation for LLM pretraining.

Technology Category

Application Category

๐Ÿ“ Abstract
As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision
Problem

Research questions and friction points this paper is trying to address.

Filtering noisy text data efficiently for large language model pretraining
Overcoming slow speed and unreliability of perplexity-based filtering methods
Handling noisy or out-of-distribution samples in web corpora effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses corpus-level term frequency statistics
Filters documents based on token priors
Requires no model inference for filtering
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yeongbin Seo
Department of Artificial Intelligence, Yonsei University
G
Gayoung Kim
Department of Artificial Intelligence, Yonsei University
J
Jaehyung Kim
Department of Artificial Intelligence, Yonsei University
Jinyoung Yeo
Jinyoung Yeo
Yonsei University
Natural Language ProceesingLarge Language ModelsAI Agents