Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dialogue emotion recognition (ERC) faces dual challenges: heterogeneous modality contributions and complex frame-level cross-modal alignment. To address these, we propose a utterance-level multimodal fusion framework. First, prompt learning enhances the semantic representation capability of the textual modality. Second, a knowledge distillation mechanism is introduced to improve the discriminative power of weaker modalities (e.g., audio and visual). Third, an anchor-gated Transformer is designed to dynamically weight and fuse cross-modal features at the utterance level, eliminating redundant frame-level alignment. Our approach explicitly balances modality specificity and complementarity. Experiments on IEMOCAP and MELD demonstrate state-of-the-art performance, validating the effectiveness of synergistically combining prompt learning and knowledge distillation for multimodal representation learning. Moreover, the utterance-level design significantly improves computational efficiency and generalization capability.

Technology Category

Application Category

📝 Abstract
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.
Problem

Research questions and friction points this paper is trying to address.

Detect emotions in conversation utterances effectively
Integrate multi-modal features with varying contributions
Reduce complexity in modality alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Anchor Gated Transformer for integration
Knowledge Distillation to enhance weaker modalities
Prompt learning improves textual modality representations
🔎 Similar Papers
No similar papers found.
J
Jie Li
School of Computer Science and Technology, China University of Mining and Technology; Mine Digitization Engineering Research Center of Ministry of Education, China University of Mining and Technology
Shifei Ding
Shifei Ding
China University of Mining and Technology
Lili Guo
Lili Guo
China University of Mining and Technology
Emotion Recognition in Conversation
X
Xuan Li
School of Computer Science and Technology, China University of Mining and Technology; Mine Digitization Engineering Research Center of Ministry of Education, China University of Mining and Technology