Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Multimodal Emotion Recognition (MER) methods overly rely on cross-modal shared features while neglecting modality-specific cues (e.g., micro-expressions, prosody, irony). Conversely, orthogonal-decomposition strategies in MER—imposing strict orthogonality constraints—artificially sever the complementary relationship between shared and specific features. To address this, we propose an angle-optimized feature learning framework that achieves *partial disentanglement* via adaptive angular modeling, thereby preserving both uniqueness and synergy. Our approach jointly refines features through cross-modal alignment and orthogonal projection, seamlessly integrating textual, acoustic, and visual representations within a unified end-to-end architecture. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmark datasets, with strong generalization capability. The proposed paradigm offers enhanced robustness and interpretability in multimodal feature disentanglement for emotion recognition.

Technology Category

Application Category

📝 Abstract
Multimodal Emotion Recognition in Conversation (MERC) aims to enhance emotion understanding by integrating complementary cues from text, audio, and visual modalities. Existing MERC approaches predominantly focus on cross-modal shared features, often overlooking modality-specific features that capture subtle yet critical emotional cues such as micro-expressions, prosodic variations, and sarcasm. Although related work in multimodal emotion recognition (MER) has explored disentangling shared and modality-specific features, these methods typically employ rigid orthogonal constraints to achieve full disentanglement, which neglects the inherent complementarity between feature types and may limit recognition performance. To address these challenges, we propose Angle-Optimized Feature Learning (AO-FL), a framework tailored for MERC that achieves partial disentanglement of shared and specific features within each modality through adaptive angular optimization. Specifically, AO-FL aligns shared features across modalities to ensure semantic consistency, and within each modality it adaptively models the angular relationship between its shared and modality-specific features to preserve both distinctiveness and complementarity. An orthogonal projection refinement further removes redundancy in specific features and enriches shared features with contextual information, yielding more discriminative multimodal representations. Extensive experiments confirm the effectiveness of AO-FL for MERC, demonstrating superior performance over state-of-the-art approaches. Moreover, AO-FL can be seamlessly integrated with various unimodal feature extractors and extended to other multimodal fusion tasks, such as MER, thereby highlighting its strong generalization beyond MERC.
Problem

Research questions and friction points this paper is trying to address.

Enhances emotion recognition by integrating text, audio, and visual cues.
Addresses neglect of subtle modality-specific features like micro-expressions and sarcasm.
Overcomes rigid disentanglement constraints to preserve feature complementarity.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Partial disentanglement via adaptive angular optimization
Aligns shared features and models angular relationships adaptively
Orthogonal projection refines features and enriches shared context
🔎 Similar Papers
No similar papers found.
X
Xinyi Che
School of Computer Science, Sichuan University, Chengdu 610065, China
W
Wenbo Wang
Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China
Y
Yuanbo Hou
Machine Learning Group, Engineering Science, University of Oxford, U.K.
Mingjie Xie
Mingjie Xie
Beihang University
Remote Sensing Image ProcessingComputer VisionDeep Learning
Qijun Zhao
Qijun Zhao
Professor of Computer Science, Sichuan University
Biometrics3D VisionObject Detection and RecognitionFace RecognitionFingerprint Recognition
J
Jian Guan
Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China