Bimodal Connection Attention Fusion for Speech Emotion Recognition

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal sentiment recognition faces challenges including difficulty in modeling dynamic inter-modal interactions, strong noise interference, and limited performance of static fusion strategies. To address these issues, this paper proposes the Bimodal Connected Attention Fusion (BCAF) framework, which enables deep audio–text interaction through a three-level collaborative mechanism: (1) an Interaction Connection Network explicitly captures cross-modal dynamic dependencies; (2) a Bimodal Attention Network enhances semantic complementarity; and (3) a Correlation Attention Network models cross-modal statistical correlations via covariance estimation while suppressing noise. Departing from conventional unidirectional or static fusion paradigms, BCAF achieves state-of-the-art accuracy on both MELD and IEMOCAP benchmarks—outperforming the best prior methods by 2.1% and 1.8%, respectively. These results empirically validate the effectiveness of dynamic connection modeling and correlation-aware fusion for multimodal sentiment analysis.

Technology Category

Application Category

📝 Abstract
Multi-modal emotion recognition is challenging due to the difficulty of extracting features that capture subtle emotional differences. Understanding multi-modal interactions and connections is key to building effective bimodal speech emotion recognition systems. In this work, we propose Bimodal Connection Attention Fusion (BCAF) method, which includes three main modules: the interactive connection network, the bimodal attention network, and the correlative attention network. The interactive connection network uses an encoder-decoder architecture to model modality connections between audio and text while leveraging modality-specific features. The bimodal attention network enhances semantic complementation and exploits intra- and inter-modal interactions. The correlative attention network reduces cross-modal noise and captures correlations between audio and text. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed BCAF method outperforms existing state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

Extracting subtle emotional differences in multi-modal data
Understanding interactions between audio and text modalities
Reducing cross-modal noise for accurate emotion recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-decoder architecture models audio-text connections
Bimodal attention enhances semantic and modal interactions
Correlative attention reduces noise and captures correlations
🔎 Similar Papers
No similar papers found.