🤖 AI Summary
This study addresses spoken sarcasm understanding—a challenging cross-modal natural language understanding task—by systematically evaluating multimodal large language models (e.g., Qwen-Omni) in bilingual English–Chinese settings. To overcome the limitation of prior work, which focuses predominantly on textual or image-text sarcasm while neglecting the critical role of speech, we propose a collaborative gating fusion module to investigate effective audio–text, audio–visual, and trimodal joint modeling. Experimental results show that audio-only modality achieves the best performance; audio–text and audio–visual bimodal combinations significantly outperform both unimodal and full-trimodal fusion baselines; and the model attains state-of-the-art performance under zero-shot, few-shot, and LoRA-finetuned settings. This work provides the first empirical validation of strong cross-lingual generalization capability and effective modality synergy mechanisms in multimodal large models for spoken sarcasm comprehension.
📝 Abstract
Sarcasm detection remains a challenge in natural language understanding, as sarcastic intent often relies on subtle cross-modal cues spanning text, speech, and vision. While prior work has primarily focused on textual or visual-textual sarcasm, comprehensive audio-visual-textual sarcasm understanding remains underexplored. In this paper, we systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English (MUStARD++) and Chinese (MCSD 1.0) in zero-shot, few-shot, and LoRA fine-tuning settings. In addition to direct classification, we explore models as feature encoders, integrating their representations through a collaborative gating fusion module. Experimental results show that audio-based models achieve the strongest unimodal performance, while text-audio and audio-vision combinations outperform unimodal and trimodal models. Furthermore, MLLMs such as Qwen-Omni show competitive zero-shot and fine-tuned performance. Our findings highlight the potential of MLLMs for cross-lingual, audio-visual-textual sarcasm understanding.