AlignMamba-2: Enhancing Multimodal Fusion and Sentiment Analysis with Modality-Aware Mamba

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work proposes AlignMamba-2, a novel framework addressing key challenges in multimodal sentiment analysis—namely, cross-modal alignment difficulty, modality heterogeneity, and computational inefficiency. AlignMamba-2 integrates a dual-alignment mechanism based on optimal transport and maximum mean discrepancy to enhance cross-modal consistency. It further introduces a modality-aware Mixture-of-Experts architecture that effectively combines modality-specific and shared experts to better model heterogeneous data. Leveraging the efficient Mamba backbone, the proposed method achieves state-of-the-art performance on four benchmark datasets—CMU-MOSI, CMU-MOSEI, NYU-Depth V2, and MVSA-Single—demonstrating significant improvements over existing approaches in both accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract

In the era of large-scale pre-trained models, effectively adapting general knowledge to specific affective computing tasks remains a challenge, particularly regarding computational efficiency and multimodal heterogeneity. While Transformer-based methods have excelled at modeling inter-modal dependencies, their quadratic computational complexity limits their use with long-sequence data. Mamba-based models have emerged as a computationally efficient alternative; however, their inherent sequential scanning mechanism struggles to capture the global, non-sequential relationships that are crucial for effective cross-modal alignment. To address these limitations, we propose \textbf{AlignMamba-2}, an effective and efficient framework for multimodal fusion and sentiment analysis. Our approach introduces a dual alignment strategy that regularizes the model using both Optimal Transport distance and Maximum Mean Discrepancy, promoting geometric and statistical consistency between modalities without incurring any inference-time overhead. More importantly, we design a Modality-Aware Mamba layer, which employs a Mixture-of-Experts architecture with modality-specific and modality-shared experts to explicitly handle data heterogeneity during the fusion process. Extensive experiments on four challenging benchmarks, including dynamic time-series (on the CMU-MOSI and CMU-MOSEI datasets) and static image-related tasks (on the NYU-Depth V2 and MVSA-Single datasets), demonstrate that AlignMamba-2 establishes a new state-of-the-art in both effectiveness and efficiency across diverse pattern recognition tasks, ranging from dynamic time-series analysis to static image-text classification.

Problem

Research questions and friction points this paper is trying to address.

multimodal fusion

sentiment analysis

computational efficiency

cross-modal alignment

modality heterogeneity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-Aware Mamba

Multimodal Fusion

Optimal Transport