GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

176K/year
🤖 AI Summary
This work addresses the limited interpretability and weak fine-grained localization capabilities in multimodal irony target identification by proposing an embodied chain-of-thought reasoning mechanism that explicitly integrates visual grounding with logical inference to precisely align textual phrases with corresponding image regions. The core contributions include the construction of MSTI-MAX, a new dataset enriched with balanced multimodal irony cues; the design of a coordinate-aware weighted loss function; and a two-stage optimization strategy. Experimental results demonstrate that the proposed model significantly outperforms existing baselines on fine-grained irony target identification, with LLM-as-a-Judge evaluations further confirming the enhanced quality of the generated reasoning chains.

Technology Category

Application Category

📝 Abstract
Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Sarcasm Target Identification
Fine-grained Localization
Cross-modal Alignment
Interpretability
Sarcasm Detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Grounded Chain-of-Thought
Multimodal Sarcasm Target Identification
Dual-Stage Optimization
Visual Grounding
Fine-Grained Localization