🤖 AI Summary
This work addresses the limited interpretability and weak fine-grained localization capabilities in multimodal irony target identification by proposing an embodied chain-of-thought reasoning mechanism that explicitly integrates visual grounding with logical inference to precisely align textual phrases with corresponding image regions. The core contributions include the construction of MSTI-MAX, a new dataset enriched with balanced multimodal irony cues; the design of a coordinate-aware weighted loss function; and a two-stage optimization strategy. Experimental results demonstrate that the proposed model significantly outperforms existing baselines on fine-grained irony target identification, with LLM-as-a-Judge evaluations further confirming the enhanced quality of the generated reasoning chains.
📝 Abstract
Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.