InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

📅 2024-06-24
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing models for Chinese multimodal sarcasm detection overly rely on superficial textual cues while neglecting fine-grained text–image interactions. To address this, we propose a novel interactive CLIP-based architecture featuring bidirectional cross-modal representation enhancement for deep semantic alignment, and a dynamic dual-channel memory predictor that non-parametrically adapts to test-time cross-modal knowledge during inference—enhancing generalization and robustness. Our method integrates interactive contrastive learning, cross-modal embedding fusion, and a dynamic memory mechanism. Evaluated on the MMSD and MMSD2.0 benchmarks, it achieves new state-of-the-art performance, improving accuracy and F1-score by 3.2% and 2.9%, respectively. This work is the first to empirically validate the effectiveness of a non-parametric test-time memory mechanism for multimodal sarcasm identification.

Technology Category

Application Category

📝 Abstract
Sarcasm in social media, often expressed through text-image combinations, poses challenges for sentiment analysis and intention mining. Current multi-modal sarcasm detection methods have been demonstrated to overly rely on spurious cues within the textual modality, revealing a limited ability to genuinely identify sarcasm through nuanced text-image interactions. To solve this problem, we propose InterCLIP-MEP, which introduces Interactive CLIP (InterCLIP) with an efficient training strategy to extract enriched text-image representations by embedding cross-modal information directly into each encoder. Additionally, we design a Memory-Enhanced Predictor (MEP) with a dynamic dual-channel memory that stores valuable test sample knowledge during inference, acting as a non-parametric classifier for robust sarcasm recognition. Experiments on two benchmarks demonstrate that InterCLIP-MEP achieves state-of-the-art performance, with significant accuracy and F1 score improvements on MMSD and MMSD2.0. Our code is available at https://github.com/CoderChen01/InterCLIP-MEP.
Problem

Research questions and friction points this paper is trying to address.

Detecting sarcasm in text-image social media posts
Overcoming over-reliance on spurious textual cues
Enhancing recognition through nuanced cross-modal interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive CLIP with cross-modal embedding
Dynamic dual-channel memory for inference
Non-parametric classifier using test knowledge
🔎 Similar Papers
No similar papers found.
J
Junjie Chen
Key Laboratory of Computer Application Technology, School of Computer and Infomation, Anhui Polytechnic University, Anhui, China
S
Subin Huang
Key Laboratory of Computer Application Technology, School of Computer and Infomation, Anhui Polytechnic University, Anhui, China