🤖 AI Summary
This work evaluates the capability of multimodal large language models (MLLMs) to comprehend video humor using visual cues alone. To this end, we introduce V-HUB—the first purely vision-centric video humor benchmark—comprising classic silent films and contemporary short online videos, characterized by sparse textual content yet rich human annotations. V-HUB supports three tasks: caption matching, humor explanation, and open-domain video question answering. It enables the first systematic evaluation of state-of-the-art video-LLMs and omni-LLMs on visual-only humor understanding, revealing a substantial performance degradation in audio-free settings; critically, we demonstrate that integrating audio significantly improves results. Experiments uncover fundamental limitations of current MLLMs in cross-modal humor transfer, underscoring the necessity of fine-grained visual semantic modeling and effective multimodal fusion.
📝 Abstract
AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel visual-centric video humor understanding benchmark. v-HUB comprises a curated collection of minimally verbal short videos, sourced from classic silent films and online resources, and reflecting real-world scenarios where humor can be appreciated purely through visual cues. Each video clip is paired with rich annotations, including captions, descriptions, and explanations, supporting evaluation tasks like caption matching and humor explanation. To broaden its applicability, we further construct an open-ended video QA task, making it readily integrable into existing video understanding benchmarks. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. For example, all models exhibit a marked performance drop on caption matching when moving from text-based to video-based evaluation (without audio). Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the informativeness of sound and the promise of integrating richer modalities for complex video understanding tasks.