Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts

📅 2024-02-12
📈 Citations: 19
Influential: 1
📄 PDF
🤖 AI Summary
To address the challenges of low-quality mathematical pretraining data, high human annotation costs, and supervision-dependent filtering, this paper proposes AutoDS—a zero-shot, self-driven automatic filtering method for mathematical text. AutoDS leverages foundational language models as generative classifiers, directly quantifying textual mathematical informativeness and pedagogical value via raw logits—eliminating the need for manual annotations or auxiliary training data. Its core innovation is the first-ever logits-driven relevance assessment mechanism, integrated with semantic modeling and continual pretraining ensembling. Evaluated on MATH, GSM8K, and BBH, AutoDS significantly improves downstream reasoning performance, achieving a 2× gain in pretraining token efficiency over baselines. The authors publicly release the AutoMathText dataset and implementation code, advancing efficient, reproducible training paradigms for mathematical language models.

Technology Category

Application Category

📝 Abstract
We present Autonomous Data Selection (AutoDS), a method that leverages base language models themselves as zero-shot"generative classifiers"to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model's logits to determine whether a given passage is mathematically informative and educational. By integrating AutoDS into a continual pretraining pipeline, we substantially boost downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods. Empirically, our approach achieves roughly a twofold improvement in pretraining token efficiency over strong baselines, underscoring the potential of self-directed data selection in enhancing mathematical reasoning. We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.
Problem

Research questions and friction points this paper is trying to address.

Automates high-quality math text curation
Enhances pretraining token efficiency
Improves downstream math benchmark performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot generative classifiers
Autonomous data selection
Continual pretraining pipeline
🔎 Similar Papers
No similar papers found.