๐ค AI Summary
This study addresses the inherent ambiguity in subjective emotion classification caused by annotator disagreement and the consequent need for effective uncertainty quantification. The work proposes a novel approach that integrates soft-label learning with Bayesian deep learning by training a linear head on top of a frozen RoBERTa feature extractor. To approximate the true annotator distribution, the method employs cyclical stochastic gradient Markov Chain Monte Carlo (SG-MCMC) and introduces posterior temperature scaling to enhance calibration. A comprehensive five-dimensional evaluation framework is introduced, revealing that calibration quality under hard labels and fidelity to the annotator distribution constitute distinct evaluation axes. On the GoEmotions dataset, the proposed method significantly outperforms Monte Carlo Dropout and deep ensemble baselines in terms of JensenโShannon divergence, Spearman correlation, and AURC/AUROC metrics.
๐ Abstract
Annotator disagreement in emotion classification reflects ambiguity intrinsic to emotion concepts and is essential for predictor-quality assessment in subjective NLP. Yet no prior work integrates soft-label learning with Bayesian deep learning to evaluate uncertainty along axes including annotator-distribution fidelity. We train a linear head on a frozen RoBERTa via cyclical stochastic gradient Markov chain Monte Carlo (cSG-MCMC), targeting the empirical annotator distribution with a soft-label objective under a five-axis evaluation. On the 28-emotion GoEmotions benchmark, the proposed method outperforms Monte Carlo Dropout and Deep Ensemble simultaneously on three axes -- Jensen-Shannon divergence (JSD) to the annotator distribution, Spearman correlation between per-emotion aleatoric uncertainty and disagreement, and selective-prediction Area Under the Risk-Coverage Curve (AURC) and Area Under the ROC Curve (AUROC) -- showing independent axes are jointly attainable from one posterior. Post-hoc temperature scaling exhibits a bidirectional effect, establishing hard-label calibration and annotator-JSD as independent dimensions and motivating joint reporting as an honest protocol.