🤖 AI Summary
This study addresses the challenge that current vision-language models struggle to comprehend multimodal puns—linguistic expressions where textual and visual elements jointly convey humorous or ambiguous meanings—and lack a systematic benchmark for evaluation. To tackle this gap, we present the first systematic investigation of multimodal pun understanding, introducing a novel generation pipeline and constructing MultiPun, a high-quality evaluation dataset featuring adversarial non-pun distractors. Leveraging this benchmark, we enhance model performance through a combination of prompt engineering and fine-tuning strategies. Experimental results demonstrate that state-of-the-art models exhibit limited capability in distinguishing genuine multimodal puns from carefully designed distractors, whereas our approach achieves an average F1 score improvement of 16.5%, substantially advancing multimodal semantic comprehension.
📝 Abstract
Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.