Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts

📅 2024-02-12

📈 Citations: 19

✨ Influential: 1

career value

160K/year

🤖 AI Summary

To address the challenges of low-quality mathematical pretraining data, high human annotation costs, and supervision-dependent filtering, this paper proposes AutoDS—a zero-shot, self-driven automatic filtering method for mathematical text. AutoDS leverages foundational language models as generative classifiers, directly quantifying textual mathematical informativeness and pedagogical value via raw logits—eliminating the need for manual annotations or auxiliary training data. Its core innovation is the first-ever logits-driven relevance assessment mechanism, integrated with semantic modeling and continual pretraining ensembling. Evaluated on MATH, GSM8K, and BBH, AutoDS significantly improves downstream reasoning performance, achieving a 2× gain in pretraining token efficiency over baselines. The authors publicly release the AutoMathText dataset and implementation code, advancing efficient, reproducible training paradigms for mathematical language models.

Technology Category

Application Category

📝 Abstract

We present Autonomous Data Selection (AutoDS), a method that leverages base language models themselves as zero-shot"generative classifiers"to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model's logits to determine whether a given passage is mathematically informative and educational. By integrating AutoDS into a continual pretraining pipeline, we substantially boost downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods. Empirically, our approach achieves roughly a twofold improvement in pretraining token efficiency over strong baselines, underscoring the potential of self-directed data selection in enhancing mathematical reasoning. We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

Problem

Research questions and friction points this paper is trying to address.

Automates high-quality math text curation

Enhances pretraining token efficiency

Improves downstream math benchmark performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot generative classifiers

Autonomous data selection

Continual pretraining pipeline

🔎 Similar Papers

MOSAIC: Multiple Observers Spotting AI Content, a Robust Approach to Machine-Generated Text Detection