GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations

πŸ“… 2025-03-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing dialogue emotion recognition methods struggle to model the dynamic evolution of emotions and lack interpretability in multimodal feature alignment and emotion change attribution. To address these limitations, we propose a multimodal modeling framework tailored for dynamic emotional evolution: (1) a Dialogue-based Emotion Decoder (DED) explicitly captures temporal emotion dynamics; (2) CLAP pretraining combined with cross-modal gated xLSTM enables fine-grained audio-text feature alignment and key utterance focusing; and (3) a psychology-inspired emotion attribution mechanism enhances model interpretability. Evaluated on IEMOCAP, our approach achieves state-of-the-art four-class accuracy among open-source methods. This work is the first to unify interpretable decoding, cross-modal gated modeling, and emotion attribution analysis within the dialogue emotion evolution taskβ€”bridging representation learning, temporal modeling, and cognitive grounding in a single framework.

Technology Category

Application Category

πŸ“ Abstract
Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual's expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. To bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Additionally, the Dialogical Emotion Decoder (DED) refines emotion predictions by modelling contextual dependencies. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective.
Problem

Research questions and friction points this paper is trying to address.

Dynamic emotion recognition in multimodal conversations
Aligning speech-text features for emotion analysis
Interpreting emotional shifts in dialogue context
Innovation

Methods, ideas, or system contributions that make the work stand out.

GatedxLSTM integrates speech-text multimodal analysis.
Uses CLAP for cross-modal alignment enhancement.
DED models contextual dependencies for emotion prediction.
πŸ”Ž Similar Papers
No similar papers found.
Y
Yupei Li
GLAM, Department of Computing, Imperial College London, UK
Qiyang Sun
Qiyang Sun
Imperial College London
Sunil Munthumoduku Krishna Murthy
Sunil Munthumoduku Krishna Murthy
Technical University of Munich
AIAI for HealthComputer Vision
Emran Alturki
Emran Alturki
PhD student, Imperial College London
B
Bjorn W. Schuller
GLAM, Department of Computing, Imperial College London, UK; CHI – Chair of Health Informatics, Technical University of Munich, Germany; relAI – the Konrad Zuse School of Excellence in Reliable AI, Munich, Germany; MDSI – Munich Data Science Institute, Munich, Germany; and MCML – Munich Center for Machine Learning, Munich, Germany