Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Dialogue emotion recognition (ERC) faces dual challenges: heterogeneous modality contributions and complex frame-level cross-modal alignment. To address these, we propose a utterance-level multimodal fusion framework. First, prompt learning enhances the semantic representation capability of the textual modality. Second, a knowledge distillation mechanism is introduced to improve the discriminative power of weaker modalities (e.g., audio and visual). Third, an anchor-gated Transformer is designed to dynamically weight and fuse cross-modal features at the utterance level, eliminating redundant frame-level alignment. Our approach explicitly balances modality specificity and complementarity. Experiments on IEMOCAP and MELD demonstrate state-of-the-art performance, validating the effectiveness of synergistically combining prompt learning and knowledge distillation for multimodal representation learning. Moreover, the utterance-level design significantly improves computational efficiency and generalization capability.

Technology Category

Application Category

📝 Abstract

Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.

Problem

Research questions and friction points this paper is trying to address.

Detect emotions in conversation utterances effectively

Integrate multi-modal features with varying contributions

Reduce complexity in modality alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Anchor Gated Transformer for integration

Knowledge Distillation to enhance weaker modalities

Prompt learning improves textual modality representations

🔎 Similar Papers

No similar papers found.