Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis

πŸ“… 2025-04-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the limitations of fine-grained visual understanding and the disconnection between semantic and affective cognition in Multimodal Aspect-Sentiment Classification (MASC), this paper proposes Chimeraβ€”a novel framework integrating dual causal modeling grounded in cognitive science and aesthetics. Chimera jointly models semantic content and affective-cognitive resonance via three core components: (1) patch-word alignment between visual patches and textual tokens; (2) hierarchical region extraction (coarse- and fine-grained) followed by textual description generation; and (3) LLM-driven reasoning for sentiment causality and aesthetic impression inference. As an end-to-end solution, Chimera achieves state-of-the-art performance on standard MASC benchmarks and demonstrates superior generalization compared to general-purpose large language-vision models such as GPT-4o. The code and dataset are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms, aimed at predicting sentiment polarity toward specific aspect targets (i.e., entities or attributes explicitly mentioned in text-image pairs). Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content and the cognitive rationales derived from semantic content and impressions (cognitive interpretations of emotions evoked by image content). In this study, we present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects and infer the fundamental drivers of sentiment expression from both semantic perspectives and affective-cognitive resonance (the synergistic effect between emotional responses and cognitive interpretations). Specifically, this framework first incorporates visual patch features for patch-word alignment. Meanwhile, it extracts coarse-grained visual features (e.g., overall image representation) and fine-grained visual regions (e.g., aspect-related regions) and translates them into corresponding textual descriptions (e.g., facial, aesthetic). Finally, we leverage the sentimental causes and impressions generated by a large language model (LLM) to enhance the model's awareness of sentimental cues evoked by semantic content and affective-cognitive resonance. Experimental results on standard MASC datasets demonstrate the effectiveness of the proposed model, which also exhibits greater flexibility to MASC compared to LLMs such as GPT-4o. We have publicly released the complete implementation and dataset at https://github.com/Xillv/Chimera
Problem

Research questions and friction points this paper is trying to address.

Understanding fine-grained visual content in sentiment analysis
Exploring cognitive rationales from semantic and emotional impressions
Inferring sentiment drivers from affective-cognitive resonance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual patch-word alignment for fine-grained features
Coarse and fine-grained visual feature extraction
LLM-enhanced sentimental cause and impression analysis
πŸ”Ž Similar Papers
No similar papers found.