Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation

📅 2025-06-04

📈 Citations: 1

✨ Influential: 0

career value

143K/year

🤖 AI Summary

To address the problem that large language models (LLMs) often mislabel ambiguous or difficult instances due to inherent uncertainty—thereby degrading downstream task performance—this paper proposes a novel “candidate labeling” paradigm. Instead of generating a single deterministic label, the LLM outputs all plausible candidate labels; a lightweight small language model (SLM) then performs uncertainty-aware supervised distillation to produce a high-quality single-label prediction. Inspired by human cognitive strategies for avoiding ambiguity, this approach is theoretically shown to achieve superior statistical consistency and tighter generalization bounds compared to conventional single-label annotation. Implemented within a teacher–student framework (CanDist) and enhanced with uncertainty-driven prompting, our method improves annotation quality across six text classification benchmarks, boosting downstream model accuracy by 2.1–4.7 percentage points on average. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.

Problem

Research questions and friction points this paper is trying to address.

LLMs produce incorrect labels for uncertain samples

Existing methods lack handling of ambiguous annotations

Need to distill multiple candidate labels for accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multiple candidate labels for uncertainty

Uses teacher-student framework for distillation

Employs small language model for unique labels

🔎 Similar Papers

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models