🤖 AI Summary
In multimodal aspect-based sentiment analysis (MABSA), conventional attention mechanisms suffer from quadratic computational complexity, limiting global contextual modeling and hindering fine-grained cross-modal alignment. To address these challenges, we propose KanbaFormer, a dual-path architecture. Its key contributions are: (1) Aspect-Driven Sparse Attention (ADSA), which balances computational efficiency with semantic focus on aspect-related tokens; (2) A hybrid module integrating the Selective State Space Model (Mamba) for long-range dependency capture and Kolmogorov–Arnold Networks (KANs) for enhanced nonlinear representation learning; and (3) Dynamic Tanh activation coupled with a multimodal gated fusion mechanism to improve inference stability and cross-modal consistency. Evaluated on two benchmark MABSA datasets, KanbaFormer achieves new state-of-the-art performance in aspect-sentiment triplet extraction accuracy and modality alignment quality.
📝 Abstract
Multimodal Aspect-based Sentiment Analysis (MABSA) enhances sentiment detection by integrating textual data with complementary modalities, such as images, to provide a more refined and comprehensive understanding of sentiment. However, conventional attention mechanisms, despite notable benchmarks, are hindered by quadratic complexity, limiting their ability to fully capture global contextual dependencies and rich semantic information in both modalities. To address this limitation, we introduce DualKanbaFormer, a novel framework that leverages parallel Textual and Visual KanbaFormer modules for robust multimodal analysis. Our approach incorporates Aspect-Driven Sparse Attention (ADSA) to dynamically balance coarse-grained aggregation and fine-grained selection for aspect-focused precision, ensuring the preservation of both global context awareness and local precision in textual and visual representations. Additionally, we utilize the Selective State Space Model (Mamba) to capture extensive global semantic information across both modalities. Furthermore, We replace traditional feed-forward networks and normalization with Kolmogorov-Arnold Networks (KANs) and Dynamic Tanh (DyT) to enhance non-linear expressivity and inference stability. To facilitate the effective integration of textual and visual features, we design a multimodal gated fusion layer that dynamically optimizes inter-modality interactions, significantly enhancing the models efficacy in MABSA tasks. Comprehensive experiments on two publicly available datasets reveal that DualKanbaFormer consistently outperforms several state-of-the-art (SOTA) models.