Multi-Modal Sentiment Analysis with Dynamic Attention Fusion

πŸ“… 2025-09-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Traditional unimodal sentiment analysis relies solely on textual input, neglecting paralinguistic cues such as prosody and vocal timbre, thereby limiting accurate modeling of genuine affective intent. To address this, we propose the Dynamic Attention Fusion (DAF) frameworkβ€”a lightweight, encoder-agnostic multimodal fusion approach that adaptively weights frozen text embeddings (from pretrained language models) and acoustic features without fine-tuning the encoders. Its core innovation is a parameter-efficient dynamic attention mechanism that explicitly models inter-modal uncertainty and complementarity. Evaluated on mainstream multimodal benchmarks (e.g., CMU-MOSEI, IEMOCAP), DAF consistently outperforms static fusion strategies and unimodal baselines: it achieves an average 3.2% absolute improvement in weighted F1-score and reduces prediction error by 18.7%. Ablation studies confirm that dynamic weight adaptation is critical for robustness across noisy and modality-imbalanced scenarios.

Technology Category

Application Category

πŸ“ Abstract
Traditional sentiment analysis has long been a unimodal task, relying solely on text. This approach overlooks non-verbal cues such as vocal tone and prosody that are essential for capturing true emotional intent. We introduce Dynamic Attention Fusion (DAF), a lightweight framework that combines frozen text embeddings from a pretrained language model with acoustic features from a speech encoder, using an adaptive attention mechanism to weight each modality per utterance. Without any finetuning of the underlying encoders, our proposed DAF model consistently outperforms both static fusion and unimodal baselines on a large multimodal benchmark. We report notable gains in F1-score and reductions in prediction error and perform a variety of ablation studies that support our hypothesis that the dynamic weighting strategy is crucial for modeling emotionally complex inputs. By effectively integrating verbal and non-verbal information, our approach offers a more robust foundation for sentiment prediction and carries broader impact for affective computing applications -- from emotion recognition and mental health assessment to more natural human computer interaction.
Problem

Research questions and friction points this paper is trying to address.

Integrating verbal and non-verbal cues for sentiment analysis
Overcoming limitations of unimodal text-based emotion recognition
Developing dynamic fusion of acoustic and text features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Attention Fusion for multimodal sentiment analysis
Combines text embeddings with acoustic features adaptively
Uses adaptive attention mechanism per utterance without finetuning
πŸ”Ž Similar Papers
No similar papers found.
S
Sadia Abdulhalim
Data Science and Artificial Intelligence, College of Computing and IT, University of Doha for Science and Technology, Doha, Qatar
M
Muaz Albaghdadi
Data Science and Artificial Intelligence, College of Computing and IT, University of Doha for Science and Technology, Doha, Qatar
Moshiur Farazi
Moshiur Farazi
University of Doha for Science and Technology, Australian National University
Computer VisionVision-Language ModelsApplied AI