🤖 AI Summary
Dialogue emotion recognition (ERC) faces dual challenges: heterogeneous modality contributions and complex frame-level cross-modal alignment. To address these, we propose a utterance-level multimodal fusion framework. First, prompt learning enhances the semantic representation capability of the textual modality. Second, a knowledge distillation mechanism is introduced to improve the discriminative power of weaker modalities (e.g., audio and visual). Third, an anchor-gated Transformer is designed to dynamically weight and fuse cross-modal features at the utterance level, eliminating redundant frame-level alignment. Our approach explicitly balances modality specificity and complementarity. Experiments on IEMOCAP and MELD demonstrate state-of-the-art performance, validating the effectiveness of synergistically combining prompt learning and knowledge distillation for multimodal representation learning. Moreover, the utterance-level design significantly improves computational efficiency and generalization capability.
📝 Abstract
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.