State-Anchored Complete-View Distillation for Robust Conversational Multimodal Emotion Recognition

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability in multimodal emotion recognition caused by missing or unreliable modalities during conversational interactions. To tackle this challenge, the authors propose the CoRe-KD framework, which leverages a complete-view teacher model to provide multi-level reference signals that guide the student model in aligning both its predictions and internal states under modality absence. The approach innovatively integrates Complete-view State Anchoring (CSA) and Nonverbal Conflict Exposure (NCE) to prevent non-unique reconstructions of missing modalities. By combining reference-guided knowledge distillation, state alignment, conflict-aware regularization, and multi-granularity modality fusion, CoRe-KD significantly enhances model robustness. Extensive experiments on IEMOCAP and MELD demonstrate consistent performance gains under both fixed and random modality missing scenarios, while ablation studies confirm the contribution of each component.
📝 Abstract
Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonverbal cues may conflict with the target utterance. To this end, we propose CoRe-KD (Complete-view Reference-guided Knowledge Distillation), a state-anchored, conflict-regularized complete-view distillation framework for robust conversational MER. A complete-view teacher provides structured references, including prediction-level references, fused states, and modality-specific states. Complete-view State Anchoring (CSA) aligns incomplete-view student predictions and states with these references, while Nonverbal Conflict Exposure (NCE) trains on target-preserving nonverbal conflict views to reduce donor-label bias. Experiments on IEMOCAP and MELD, with CMU-MOSEI as a supplementary utterance-level check, show consistent gains under fixed- and random-missing protocols. Comprehensive ablation studies and further analyses support the role of CSA and the complementary effect of NCE.
Problem

Research questions and friction points this paper is trying to address.

Conversational Multimodal Emotion Recognition
Missing Modality
Robustness
Nonverbal Conflict
Emotion Prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation
multimodal emotion recognition
missing modality
state anchoring
nonverbal conflict
🔎 Similar Papers
No similar papers found.
Z
Zhaoyan Pan
Shcool of Software Technology, Zhejiang University
X
Xiangdong Li
Shcool of Software Technology, Zhejiang University
W
Wenke Wu
Shcool of Software Technology, Zhejiang University
M
Mengting Ma
Shcool of Computer Science and Technology, Zhejiang University
Y
Ye Lou
Shcool of Software Technology, Zhejiang University
J
Ji Zhou
Shcool of Software Technology, Zhejiang University
J
Jiatong Pan
Shcool of Software Technology, Zhejiang University
Wei Zhang
Wei Zhang
Zhejiang University
digital humanitiesdata visualization