A Unified Evaluation Framework for Multi-Annotator Tendency Learning

πŸ“… 2025-08-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing individual tendency learning (ITL) methods lack a unified, quantifiable framework to rigorously assess whether they genuinely capture annotator-level behavioral differences and yield behaviorally plausible explanations. This paper introduces the first evaluation framework for ITL in multi-annotator settings. Its core contributions are: (1) the Difference-aware Inter-annotator Consistency (DIC) metric, which quantifies a model’s ability to capture heterogeneity in annotator behavior; and (2) the Behavior-aligned Explainability (BAE) metric, the first to jointly evaluate the alignment between ITL-generated explanations and empirically observed annotator behavior. The framework integrates multidimensional scaling, predictive similarity structure comparison, and explanation verification, grounded in real-world annotation data. Extensive experiments demonstrate that DIC and BAE effectively distinguish state-of-the-art ITL methods in both tendency modeling fidelity and explanation plausibility, establishing a reliable, behaviorally grounded benchmark for future ITL research.

Technology Category

Application Category

πŸ“ Abstract
Recent works have emerged in multi-annotator learning that shift focus from Consensus-oriented Learning (CoL), which aggregates multiple annotations into a single ground-truth prediction, to Individual Tendency Learning (ITL), which models annotator-specific labeling behavior patterns (i.e., tendency) to provide explanation analysis for understanding annotator decisions. However, no evaluation framework currently exists to assess whether ITL methods truly capture individual tendencies and provide meaningful behavioral explanations. To address this gap, we propose the first unified evaluation framework with two novel metrics: (1) Difference of Inter-annotator Consistency (DIC) quantifies how well models capture annotator tendencies by comparing predicted inter-annotator similarity structures with ground-truth; (2) Behavior Alignment Explainability (BAE) evaluates how well model explanations reflect annotator behavior and decision relevance by aligning explainability-derived with ground-truth labeling similarity structures via Multidimensional Scaling (MDS). Extensive experiments validate the effectiveness of our proposed evaluation framework.
Problem

Research questions and friction points this paper is trying to address.

Evaluates if models capture individual annotator tendencies
Assesses behavioral explanation meaningfulness in multi-annotator learning
Quantifies alignment between predicted and ground-truth annotator behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified evaluation framework with novel metrics
DIC metric quantifies inter-annotator similarity capture
BAE metric aligns explanations with behavior via MDS
πŸ”Ž Similar Papers
No similar papers found.
L
Liyun Zhang
D3 Center, The University of Osaka
J
Jingcheng Ke
D3 Center, The University of Osaka
S
Shenli Fan
Business Administration, Osaka University of Economics and Law
X
Xuanmeng Sha
IST, The University of Osaka
Zheng Lian
Zheng Lian
Associate Professor, IEEE/CCF Senior Member, Institute of Automation, Chinese Academy of Sciences
Affective ComputingSentiment AnalysisMachine Learning