The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

📅 2024-06-25
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Conventional pretraining data annotation relying on black-box API calls faces critical bottlenecks—including prohibitively high invocation costs, non-editable outputs, and poor auditability. Method: This paper proposes a novel paradigm—“large language models (LLMs) generating executable annotation programs”—where LLMs synthesize Python-based annotation code that executes locally, enabling lightweight, iterative validation and refinement. Contribution/Results: The approach achieves annotation quality on par with or up to 12.9% higher than baseline methods across multiple tasks, while reducing total annotation cost by approximately 500×. It ensures high fidelity, reproducibility, transparency, and reusability—overcoming the limitations of static datasets and costly external API dependencies. To our knowledge, this is the first work to integrate program synthesis into the data annotation pipeline, establishing a new paradigm for efficient, controllable, and sustainable data engineering.

Technology Category

Application Category

📝 Abstract
Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500x.
Problem

Research questions and friction points this paper is trying to address.

Pre-trained models
Data annotation
Cost-efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Alchemist System
Cost Reduction
Efficiency Improvement
🔎 Similar Papers
No similar papers found.