🤖 AI Summary
Multimodal sarcasm detection requires cross-modal cue integration and pragmatic inference, yet existing video-language models (VideoLMs) exhibit significant limitations in interpreting such implicit intentions. Method: We introduce MUStReason, a diagnostic benchmark featuring systematic annotations of modality-specific cues and fine-grained reasoning steps, exposing weaknesses in cross-modal perception and intent inference. We further propose PragCoT—a chain-of-thought prompting framework that decouples sarcasm detection into explicit “perception” and “pragmatic reasoning” stages, guiding models to transcend literal semantics and model non-literal intent. Contribution/Results: Through quantitative evaluation and qualitative attribution analysis, we achieve interpretable assessment of reasoning processes. Experiments demonstrate that PragCoT substantially enhances VideoLMs’ pragmatic understanding, establishing a new benchmark and methodological paradigm for multimodal pragmatic reasoning research.
📝 Abstract
Sarcasm is a specific type of irony which involves discerning what is said from what is meant. Detecting sarcasm depends not only on the literal content of an utterance but also on non-verbal cues such as speaker's tonality, facial expressions and conversational context. However, current multimodal models struggle with complex tasks like sarcasm detection, which require identifying relevant cues across modalities and pragmatically reasoning over them to infer the speaker's intention. To explore these limitations in VideoLMs, we introduce MUStReason, a diagnostic benchmark enriched with annotations of modality-specific relevant cues and underlying reasoning steps to identify sarcastic intent. In addition to benchmarking sarcasm classification performance in VideoLMs, using MUStReason we quantitatively and qualitatively evaluate the generated reasoning by disentangling the problem into perception and reasoning, we propose PragCoT, a framework that steers VideoLMs to focus on implied intentions over literal meaning, a property core to detecting sarcasm.