🤖 AI Summary
This study addresses the challenge of modeling human annotation variability in natural language processing—arising from subjectivity, ambiguity, and legitimate annotator disagreement. We propose a two-stage contextual meta-learning framework that integrates general-purpose post-training with task-distribution-specific contextual meta-learning. Crucially, annotator-specific examples are explicitly incorporated into in-context learning to model individual preferences, and model scaling strategies are employed to enhance generalization. Our key contribution is the empirical discovery and validation—via ablation studies—that annotator examples play a decisive role in contextual learning for subjective tasks. Evaluated on two highly subjective tasks in the LeWiDi-2025 competition, our method achieves first place in both, significantly outperforming all baselines. Results demonstrate its effectiveness in capturing the diversity of human judgments and improving model robustness and adaptability to subjective linguistic phenomena.
📝 Abstract
Many natural language processing (NLP) tasks involve subjectivity, ambiguity, or legitimate disagreement between annotators. In this paper, we outline our system for modeling human variation. Our system leverages language models' (LLMs) in-context learning abilities, along with a two-step meta-learning training procedure for 1) post-training on many datasets requiring in-context learning and 2) specializing the model via in-context meta-learning to the particular data distribution of interest. We also evaluate the performance of our system submission to the Learning With Disagreements (LeWiDi) competition, where it was the overall winner on both tasks. Additionally, we perform an ablation study to measure the importance of each system component. We find that including rater examples in-context is crucial for our system's performance, dataset-specific fine-tuning is helpful on the larger datasets, post-training on other in-context datasets is helpful on one of the competition datasets, and that performance improves with model scale.