REALM: Reliable Expertise-Aware Language Model Fine-Tuning from Noisy Annotations

๐Ÿ“… 2026-04-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

188K/year
๐Ÿค– AI Summary
This work addresses the performance degradation of large language models during supervised fine-tuning caused by noisy crowdsourced annotations, a problem exacerbated by existing methods that ignore annotator expertise heterogeneity. The authors propose REALM, a method that jointly learns model parameters and annotator competence scores during fine-tuning. Observed labels are modeled as a weighted mixture of the modelโ€™s predictions and random guesses, with weights dynamically determined by each annotatorโ€™s estimated competence. REALM enables unbiased estimation of annotator competence without additional supervision and naturally extends to multitask settings via a competence matrix that captures task-specific reliability. Experiments with Flan-T5 demonstrate that REALM significantly outperforms baseline approaches across five question-answering benchmarks, achieving up to a 50% accuracy gain under extreme noise conditions, with improvements amplifying as model scale increases.

Technology Category

Application Category

๐Ÿ“ Abstract
Supervised fine-tuning of large language models relies on human-annotated data, yet annotation pipelines routinely involve multiple crowdworkers of heterogeneous expertise. Standard practice aggregates labels via majority vote or simple averaging, discarding annotator identity and causing the model to absorb the errors of unreliable annotators directly into its parameters. We propose REALM, a method that jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. We evaluate on five question answering benchmarks, fine-tuning three sizes of Flan-T5 under simulated noisy annotations. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to $50\%$ in the most adversarial regime and gains that grow with model capacity.
Problem

Research questions and friction points this paper is trying to address.

noisy annotations
annotator expertise
supervised fine-tuning
label aggregation
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

expertise-aware learning
noisy annotation
supervised fine-tuning
annotator reliability
multi-task learning
๐Ÿ”Ž Similar Papers
No similar papers found.