🤖 AI Summary
This work investigates the causes and implications of model prediction disagreement arising from modality conflicts in multimodal empathy detection. To address performance degradation in fusion models caused by inconsistencies across textual, acoustic, and visual signals, we propose treating model disagreement as a diagnostic indicator of semantic ambiguity. Our method introduces fine-tuned unimodal baselines and a gated multimodal fusion model, integrated with disagreement analysis and human annotator consistency evaluation. Key contributions include: (1) identification that a dominant modality lacking cross-modal support can mislead fusion decisions; (2) empirical evidence that model disagreement strongly correlates with annotator uncertainty, enabling effective detection of system-fragile samples; and (3) demonstration that humans do not consistently benefit from multimodal inputs—validating the inherent ambiguity of emotional expression. These findings establish a novel paradigm for robust empathy modeling and uncertainty-aware inference.
📝 Abstract
Multimodal models play a key role in empathy detection, but their performance can suffer when modalities provide conflicting cues. To understand these failures, we examine cases where unimodal and multimodal predictions diverge. Using fine-tuned models for text, audio, and video, along with a gated fusion model, we find that such disagreements often reflect underlying ambiguity, as evidenced by annotator uncertainty. Our analysis shows that dominant signals in one modality can mislead fusion when unsupported by others. We also observe that humans, like models, do not consistently benefit from multimodal input. These insights position disagreement as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.