State-Anchored Complete-View Distillation for Robust Conversational Multimodal Emotion Recognition

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the instability in multimodal emotion recognition caused by missing or unreliable modalities during conversational interactions. To tackle this challenge, the authors propose the CoRe-KD framework, which leverages a complete-view teacher model to provide multi-level reference signals that guide the student model in aligning both its predictions and internal states under modality absence. The approach innovatively integrates Complete-view State Anchoring (CSA) and Nonverbal Conflict Exposure (NCE) to prevent non-unique reconstructions of missing modalities. By combining reference-guided knowledge distillation, state alignment, conflict-aware regularization, and multi-granularity modality fusion, CoRe-KD significantly enhances model robustness. Extensive experiments on IEMOCAP and MELD demonstrate consistent performance gains under both fixed and random modality missing scenarios, while ablation studies confirm the contribution of each component.

📝 Abstract

Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonverbal cues may conflict with the target utterance. To this end, we propose CoRe-KD (Complete-view Reference-guided Knowledge Distillation), a state-anchored, conflict-regularized complete-view distillation framework for robust conversational MER. A complete-view teacher provides structured references, including prediction-level references, fused states, and modality-specific states. Complete-view State Anchoring (CSA) aligns incomplete-view student predictions and states with these references, while Nonverbal Conflict Exposure (NCE) trains on target-preserving nonverbal conflict views to reduce donor-label bias. Experiments on IEMOCAP and MELD, with CMU-MOSEI as a supplementary utterance-level check, show consistent gains under fixed- and random-missing protocols. Comprehensive ablation studies and further analyses support the role of CSA and the complementary effect of NCE.

Problem

Research questions and friction points this paper is trying to address.

Conversational Multimodal Emotion Recognition

Missing Modality

Robustness

Nonverbal Conflict

Emotion Prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation

multimodal emotion recognition

missing modality