Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses spoken sarcasm understanding—a challenging cross-modal natural language understanding task—by systematically evaluating multimodal large language models (e.g., Qwen-Omni) in bilingual English–Chinese settings. To overcome the limitation of prior work, which focuses predominantly on textual or image-text sarcasm while neglecting the critical role of speech, we propose a collaborative gating fusion module to investigate effective audio–text, audio–visual, and trimodal joint modeling. Experimental results show that audio-only modality achieves the best performance; audio–text and audio–visual bimodal combinations significantly outperform both unimodal and full-trimodal fusion baselines; and the model attains state-of-the-art performance under zero-shot, few-shot, and LoRA-finetuned settings. This work provides the first empirical validation of strong cross-lingual generalization capability and effective modality synergy mechanisms in multimodal large models for spoken sarcasm comprehension.

Technology Category

Application Category

📝 Abstract

Sarcasm detection remains a challenge in natural language understanding, as sarcastic intent often relies on subtle cross-modal cues spanning text, speech, and vision. While prior work has primarily focused on textual or visual-textual sarcasm, comprehensive audio-visual-textual sarcasm understanding remains underexplored. In this paper, we systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English (MUStARD++) and Chinese (MCSD 1.0) in zero-shot, few-shot, and LoRA fine-tuning settings. In addition to direct classification, we explore models as feature encoders, integrating their representations through a collaborative gating fusion module. Experimental results show that audio-based models achieve the strongest unimodal performance, while text-audio and audio-vision combinations outperform unimodal and trimodal models. Furthermore, MLLMs such as Qwen-Omni show competitive zero-shot and fine-tuned performance. Our findings highlight the potential of MLLMs for cross-lingual, audio-visual-textual sarcasm understanding.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal LLMs on spoken sarcasm detection

Assessing cross-modal cues in text, speech, and vision

Exploring sarcasm understanding in English and Chinese datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs with collaborative gating fusion

Audio-based models achieve strongest unimodal performance

Cross-lingual audio-visual-textual sarcasm understanding

🔎 Similar Papers

No similar papers found.