Automatic Labelling with Open-source LLMs using Dynamic Label Schema Integration

📅 2025-01-21

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address the low accuracy, high cost, and privacy risks associated with automatic labeling by open-source large language models (e.g., Llama, Phi) in high-cardinality classification tasks, this paper proposes Retrieval-Augmented Classification (RAC). RAC innovatively integrates retrieval principles into the classification pipeline: it dynamically ranks labels via embedding-based matching against descriptive label representations and employs per-label reasoning with adaptive early stopping to avoid exhaustive scoring across the full label space. It introduces the first dynamic label schema integration framework, enabling controllable trade-offs between annotation quality and coverage. Evaluated on multi-domain internal datasets, RAC improves F1 score by 12.7%, achieves fully automated high-quality labeling, reduces human annotation effort by over 80%, and completely eliminates privacy leakage and API invocation costs inherent in closed-source LLM services.

Technology Category

Application Category

📝 Abstract

Acquiring labelled training data remains a costly task in real world machine learning projects to meet quantity and quality requirements. Recently Large Language Models (LLMs), notably GPT-4, have shown great promises in labelling data with high accuracy. However, privacy and cost concerns prevent the ubiquitous use of GPT-4. In this work, we explore effectively leveraging open-source models for automatic labelling. We identify integrating label schema as a promising technology but found that naively using the label description for classification leads to poor performance on high cardinality tasks. To address this, we propose Retrieval Augmented Classification (RAC) for which LLM performs inferences for one label at a time using corresponding label schema; we start with the most related label and iterates until a label is chosen by the LLM. We show that our method, which dynamically integrates label description, leads to performance improvements in labelling tasks. We further show that by focusing only on the most promising labels, RAC can trade off between label quality and coverage - a property we leverage to automatically label our internal datasets.

Problem

Research questions and friction points this paper is trying to address.

Machine Learning

Data Annotation

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Enhanced Classification

free large-scale language models

dynamic labeling

🔎 Similar Papers

Augmenting NER Datasets with LLMs: Towards Automated and Refined Annotation