🤖 AI Summary
Traditional multi-annotator learning treats annotation disagreements as noise and aggregates them into a single ground truth—yet subjective tasks lack an absolute ground truth, and sparse annotations render statistical aggregation unreliable. This work proposes a paradigm shift: abandoning sample-level aggregation in favor of modeling annotator-specific behavior, treating disagreement as informative signal. We introduce QuMATL, a lightweight query-driven behavioral learning framework that jointly models inter-annotator correlations and incorporates implicit regularization, enabling unlabeled-data reconstruction and interpretable behavioral analysis. We also present the first large-scale multimodal multi-annotator datasets, STREET and AMER. Experiments demonstrate that QuMATL significantly improves generalization under sparse annotation, enhances aggregation reliability, reduces annotation cost, and supports decision traceability via visualizable attention mechanisms.
📝 Abstract
Multi-annotator learning traditionally aggregates diverse annotations to approximate a single ground truth, treating disagreements as noise. However, this paradigm faces fundamental challenges: subjective tasks often lack absolute ground truth, and sparse annotation coverage makes aggregation statistically unreliable. We introduce a paradigm shift from sample-wise aggregation to annotator-wise behavior modeling. By treating annotator disagreements as valuable information rather than noise, modeling annotator-specific behavior patterns can reconstruct unlabeled data to reduce annotation cost, enhance aggregation reliability, and explain annotator decision behavior. To this end, we propose QuMATL (Query-based Multi-Annotator Behavior Pattern Learning), which uses light-weight queries to model individual annotators while capturing inter-annotator correlations as implicit regularization, preventing overfitting to sparse individual data while maintaining individualization and improving generalization, with a visualization of annotator focus regions offering an explainable analysis of behavior understanding. We contribute two large-scale datasets with dense per-annotator labels: STREET (4,300 labels/annotator) and AMER (average 3,118 labels/annotator), the first multimodal multi-annotator dataset.