🤖 AI Summary
The absence of a benchmark dataset has hindered research on multimodal irony detection in German. Method: This paper introduces MuSaG, the first German multimodal irony dataset, comprising aligned text, audio, and video segments extracted from 33 minutes of German television programming, all manually annotated across all modalities. MuSaG enables systematic evaluation of both unimodal and multimodal irony recognition. Contribution/Results: Benchmarking nine open-source and commercial models reveals that current systems achieve highest performance on text alone, whereas human annotators rely significantly more on auditory cues—highlighting a critical gap between model behavior and human cognitive mechanisms. MuSaG fills a longstanding data void in German multimodal irony research and provides a foundational resource for advancing models’ multimodal alignment capabilities in realistic conversational settings.
📝 Abstract
Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.