🤖 AI Summary
Weak generalization and shallow understanding plague large language models (LLMs) in multimodal sarcasm detection. To address this, we propose Commander-GPT: a modular, multi-agent collaboration framework inspired by military command-and-control theory. It establishes a hierarchical command structure that dynamically routes subtasks to specialized multimodal agents—including Multimodal BERT, DeepSeek-VL, Gemini Pro, and GPT-4o—and fuses their outputs. A lightweight encoder coupled with an autoregressive language model enables zero-shot task allocation and result aggregation. This work is the first to systematically apply the command-and-control paradigm to multimodal sarcasm identification, overcoming the cognitive limitations of monolithic models. Evaluated on the MMSD and MMSD 2.0 benchmarks, Commander-GPT achieves average F1 improvements of 4.4% and 11.7%, respectively, significantly outperforming state-of-the-art methods.
📝 Abstract
Multimodal sarcasm understanding is a high-order cognitive task. Although large language models (LLMs) have shown impressive performance on many downstream NLP tasks, growing evidence suggests that they struggle with sarcasm understanding. In this paper, we propose Commander-GPT, a modular decision routing framework inspired by military command theory. Rather than relying on a single LLM's capability, Commander-GPT orchestrates a team of specialized LLM agents where each agent will be selectively assigned to a focused sub-task such as context modeling, sentiment analysis, etc. Their outputs are then routed back to the commander, which integrates the information and performs the final sarcasm judgment. To coordinate these agents, we introduce three types of centralized commanders: (1) a trained lightweight encoder-based commander (e.g., multi-modal BERT); (2) four small autoregressive language models, serving as moderately capable commanders (e.g., DeepSeek-VL); (3) two large LLM-based commander (Gemini Pro and GPT-4o) that performs task routing, output aggregation, and sarcasm decision-making in a zero-shot fashion. We evaluate Commander-GPT on the MMSD and MMSD 2.0 benchmarks, comparing five prompting strategies. Experimental results show that our framework achieves 4.4% and 11.7% improvement in F1 score over state-of-the-art (SoTA) baselines on average, demonstrating its effectiveness.