MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

The absence of a benchmark dataset has hindered research on multimodal irony detection in German. Method: This paper introduces MuSaG, the first German multimodal irony dataset, comprising aligned text, audio, and video segments extracted from 33 minutes of German television programming, all manually annotated across all modalities. MuSaG enables systematic evaluation of both unimodal and multimodal irony recognition. Contribution/Results: Benchmarking nine open-source and commercial models reveals that current systems achieve highest performance on text alone, whereas human annotators rely significantly more on auditory cues—highlighting a critical gap between model behavior and human cognitive mechanisms. MuSaG fills a longstanding data void in German multimodal irony research and provides a foundational resource for advancing models’ multimodal alignment capabilities in realistic conversational settings.

Technology Category

Application Category

📝 Abstract

Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.

Problem

Research questions and friction points this paper is trying to address.

Detecting sarcasm in German multimodal content

Addressing performance gaps in multimodal sarcasm detection models

Integrating audio-visual-textual cues for sarcasm understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset with text, audio, video annotations

Benchmarked nine models across different architectural types

Released German sarcasm dataset for realistic scenario development

🔎 Similar Papers

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection