๐ค AI Summary
To address the challenges of fine-grained semantic alignment and limited comprehension capacity of small language models (SLMs) in multimodal aspect-based sentiment analysis (MABSA), this paper proposes LRSAโa novel collaborative framework. LRSA pioneers the integration of interpretable, LLM-generated rationales into SLM decision-making, enabling transparent and grounded predictions without end-to-end LLM training. It introduces a dual cross-attention mechanism to enhance bidirectional imageโtext feature interaction and alignment, overcoming inherent limitations of pure fine-tuning or prompt engineering. The framework achieves a favorable trade-off between interpretability and computational efficiency. Evaluated on three mainstream MABSA benchmarks, LRSA consistently outperforms state-of-the-art methods, yielding average F1-score improvements of 2.3โ4.1 percentage points. Moreover, it demonstrates strong generalizability across diverse pre-trained vision-language backbones, validating both its effectiveness and architectural versatility.
๐ Abstract
There has been growing interest in Multimodal Aspect-Based Sentiment Analysis (MABSA) in recent years. Existing methods predominantly rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text, with an aim to align these two modalities. However, small SLMs possess limited capacity and knowledge, often resulting in inaccurate identification of meaning, aspects, sentiments, and their interconnections in textual and visual data. On the other hand, Large language models (LLMs) have shown exceptional capabilities in various tasks by effectively exploring fine-grained information in multimodal data. However, some studies indicate that LLMs still fall short compared to fine-tuned small models in the field of ABSA. Based on these findings, we propose a novel framework, termed LRSA, which combines the decision-making capabilities of SLMs with additional information provided by LLMs for MABSA. Specifically, we inject explanations generated by LLMs as rationales into SLMs and employ a dual cross-attention mechanism for enhancing feature interaction and fusion, thereby augmenting the SLMs' ability to identify aspects and sentiments. We evaluated our method using two baseline models, numerous experiments highlight the superiority of our approach on three widely-used benchmarks, indicating its generalizability and applicability to most pre-trained models for MABSA.