🤖 AI Summary
Multimodal Emotion Recognition in Conversation (MERC) suffers significant performance degradation under stochastic modality missing, and existing imputation methods often introduce semantic distortion—especially under extreme missing patterns such as fixed-modality absence.
Method: This paper pioneers the integration of federated learning into modality recovery, proposing a Federated Dialogue Semantic Diffusion framework. It models contextual and speaker dependencies via a dialogue graph network and employs a semantic-conditioned diffusion model for decentralized, cross-client modality generation. An alternating freezing aggregation strategy is introduced to ensure stable collaborative training.
Contribution/Results: The framework achieves state-of-the-art performance on IEMOCAP, CMU-MOSI, and CMU-MOSEI across diverse missing patterns. It enables high-fidelity modality reconstruction and semantically consistent multimodal fusion while preserving data privacy through decentralized learning.
📝 Abstract
Multimodal Emotion Recognition in Conversations (MERC) enhances emotional understanding through the fusion of multimodal signals. However, unpredictable modality absence in real-world scenarios significantly degrades the performance of existing methods. Conventional missing-modality recovery approaches, which depend on training with complete multimodal data, often suffer from semantic distortion under extreme data distributions, such as fixed-modality absence. To address this, we propose the Federated Dialogue-guided and Semantic-Consistent Diffusion (FedDISC) framework, pioneering the integration of federated learning into missing-modality recovery. By federated aggregation of modality-specific diffusion models trained on clients and broadcasting them to clients missing corresponding modalities, FedDISC overcomes single-client reliance on modality completeness. Additionally, the DISC-Diffusion module ensures consistency in context, speaker identity, and semantics between recovered and available modalities, using a Dialogue Graph Network to capture conversational dependencies and a Semantic Conditioning Network to enforce semantic alignment. We further introduce a novel Alternating Frozen Aggregation strategy, which cyclically freezes recovery and classifier modules to facilitate collaborative optimization. Extensive experiments on the IEMOCAP, CMUMOSI, and CMUMOSEI datasets demonstrate that FedDISC achieves superior emotion classification performance across diverse missing modality patterns, outperforming existing approaches.