๐ค AI Summary
This work addresses the performance degradation of large language models during supervised fine-tuning caused by noisy crowdsourced annotations, a problem exacerbated by existing methods that ignore annotator expertise heterogeneity. The authors propose REALM, a method that jointly learns model parameters and annotator competence scores during fine-tuning. Observed labels are modeled as a weighted mixture of the modelโs predictions and random guesses, with weights dynamically determined by each annotatorโs estimated competence. REALM enables unbiased estimation of annotator competence without additional supervision and naturally extends to multitask settings via a competence matrix that captures task-specific reliability. Experiments with Flan-T5 demonstrate that REALM significantly outperforms baseline approaches across five question-answering benchmarks, achieving up to a 50% accuracy gain under extreme noise conditions, with improvements amplifying as model scale increases.
๐ Abstract
Supervised fine-tuning of large language models relies on human-annotated data, yet annotation pipelines routinely involve multiple crowdworkers of heterogeneous expertise. Standard practice aggregates labels via majority vote or simple averaging, discarding annotator identity and causing the model to absorb the errors of unreliable annotators directly into its parameters. We propose REALM, a method that jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. We evaluate on five question answering benchmarks, fine-tuning three sizes of Flan-T5 under simulated noisy annotations. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to $50\%$ in the most adversarial regime and gains that grow with model capacity.