Target-Augmented Shared Fusion-based Multimodal Sarcasm Explanation Generation

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing methods for multimodal sarcasm explanation generation neglect the explicit identification of sarcasm targets (e.g., persons, events), hindering deep interpretation of sarcastic intent. This paper introduces the first target-aware framework, proposing a target-guided shared cross-modal fusion mechanism: (1) a target-aware encoder that injects structured target priors into vision-language models (e.g., ViLBERT, ALPRO); and (2) shared-parameterized cross-modal attention to jointly align image, text, and target representations, thereby modeling the ternary relationship among “target–modality–sarcastic intent.” Evaluated on the MORE+ dataset, our approach achieves average improvements of 3.3% in BLEU, ROUGE, and F1 scores over prior work. Human evaluation confirms statistically significant superiority over state-of-the-art methods. Empirical analysis further demonstrates that large language models (LLMs) fail to capture such fine-grained sarcasm even under zero-shot or one-shot settings.

Technology Category

Application Category

📝 Abstract

Sarcasm is a linguistic phenomenon that intends to ridicule a target (e.g., entity, event, or person) in an inherent way. Multimodal Sarcasm Explanation (MuSE) aims at revealing the intended irony in a sarcastic post using a natural language explanation. Though important, existing systems overlooked the significance of the target of sarcasm in generating explanations. In this paper, we propose a Target-aUgmented shaRed fusion-Based sarcasm explanatiOn model, aka. TURBO. We design a novel shared-fusion mechanism to leverage the inter-modality relationships between an image and its caption. TURBO assumes the target of the sarcasm and guides the multimodal shared fusion mechanism in learning intricacies of the intended irony for explanations. We evaluate our proposed TURBO model on the MORE+ dataset. Comparison against multiple baselines and state-of-the-art models signifies the performance improvement of TURBO by an average margin of $+3.3%$. Moreover, we explore LLMs in zero and one-shot settings for our task and observe that LLM-generated explanation, though remarkable, often fails to capture the critical nuances of the sarcasm. Furthermore, we supplement our study with extensive human evaluation on TURBO's generated explanations and find them out to be comparatively better than other systems.

Problem

Research questions and friction points this paper is trying to address.

Generates sarcasm explanations using multimodal inputs

Incorporates target of sarcasm in explanation generation

Leverages image-caption relationships for improved performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared-fusion mechanism

Target-augmented model

Multimodal sarcasm explanation

🔎 Similar Papers

Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue