ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
This work addresses the challenge of modeling divergent emotional expressions across speakers in multi-turn dialogues, a limitation of existing emotion recognition methods. To this end, the authors propose a multi-level speaker-adaptive network that dynamically disentangles speaker identity from emotional expression without relying on explicit speaker identifiers. The approach integrates Feature-wise Linear Modulation (FiLM) at the input layer, a multimodal gating mechanism in the interaction layer, and latent space regularization at the output layer. This architecture enhances model robustness to expressive variability and achieves state-of-the-art performance on the MELD and IEMOCAP datasets, with particularly notable gains in recognizing tail-class emotions and in realistic multi-speaker scenarios.
📝 Abstract
To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotions differently. In our daily lives, we can see this. When communicating with different people, some express "happiness" through their facial expressions and words, while others may hide their happiness or express it through their actions. Both are expressions of 'happiness,' but such differences in emotional expression are still too difficult for machines to distinguish. Current emotion recognition remains at a 'static' level, using a single recognition model to identify all emotional styles. This "simplification" often affects the recognition results, especially in multi-turn dialogues. To address this problem, this paper introduces a novel Multi-Level Speaker Adaptive Network (ML-SAN), which, specifically, effectively addresses the challenge of speaker identity information confusion. ML-SAN does not simply assign a speaker's ID after recognition; instead, it employs a three-stage adaptive process: First, Input-level Calibration uses Feature-Level Linear Modulation (FiLM) to adjust the raw audio and visual features into a neutral space unrelated to the speaker. Then, Interaction-level Gating re-adjusts the trust level for each modality (e.g., voice or facial features) based on the speaker's identity information. Finally, Output-level Regularization maintains the consistency of speaker features in the latent space. Tests on the MELD and IEMOCAP datasets show that our model (ML-SAN) achieves better results, performs exceptionally well in handling challenging tail sentiment categories, and better addresses the diversity of speakers in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

emotion recognition in conversations
speaker variability
multimodal emotion recognition
individual expressive traits
speaker-adaptive modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

speaker-adaptive
multimodal emotion recognition
Feature-Level Linear Modulation
modality gating
tail sentiment categories
🔎 Similar Papers
No similar papers found.
K
Kexue Wang
Joint Research Laboratory for Embodied Intelligence, Xinjiang University; Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University; School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China
Yinfeng Yu
Yinfeng Yu
Associate Professor, Xinjiang University
Embodied intelligence
L
Liejun Wang
Joint Research Laboratory for Embodied Intelligence, Xinjiang University; Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University; School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China