V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work evaluates the capability of multimodal large language models (MLLMs) to comprehend video humor using visual cues alone. To this end, we introduce V-HUB—the first purely vision-centric video humor benchmark—comprising classic silent films and contemporary short online videos, characterized by sparse textual content yet rich human annotations. V-HUB supports three tasks: caption matching, humor explanation, and open-domain video question answering. It enables the first systematic evaluation of state-of-the-art video-LLMs and omni-LLMs on visual-only humor understanding, revealing a substantial performance degradation in audio-free settings; critically, we demonstrate that integrating audio significantly improves results. Experiments uncover fundamental limitations of current MLLMs in cross-modal humor transfer, underscoring the necessity of fine-grained visual semantic modeling and effective multimodal fusion.

Technology Category

Application Category

📝 Abstract
AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel visual-centric video humor understanding benchmark. v-HUB comprises a curated collection of minimally verbal short videos, sourced from classic silent films and online resources, and reflecting real-world scenarios where humor can be appreciated purely through visual cues. Each video clip is paired with rich annotations, including captions, descriptions, and explanations, supporting evaluation tasks like caption matching and humor explanation. To broaden its applicability, we further construct an open-ended video QA task, making it readily integrable into existing video understanding benchmarks. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. For example, all models exhibit a marked performance drop on caption matching when moving from text-based to video-based evaluation (without audio). Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the informativeness of sound and the promise of integrating richer modalities for complex video understanding tasks.
Problem

Research questions and friction points this paper is trying to address.

Benchmark evaluates MLLMs' visual humor understanding in videos
Assesses models' ability to interpret humor without audio cues
Measures performance gap between text-based and video-based humor comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces visual-centric video humor benchmark v-HUB
Evaluates multimodal models using minimal-verbal video clips
Incorporates audio to enhance humor understanding performance
🔎 Similar Papers
No similar papers found.
Z
Zhengpeng Shi
Shanghai Jiao Tong University
Hengli Li
Hengli Li
Institute for Artificial Intelligence, Peking University
Machine LearningNatural Language Processing
Yanpeng Zhao
Yanpeng Zhao
University of Edinburgh
Natural Language Understanding
Jianqun Zhou
Jianqun Zhou
Wuhan University
Y
Yuxuan Wang
Independent Researcher
Q
Qinrong Cui
Independent Researcher
Wei Bi
Wei Bi
HKUST
NLGDialog SystemNLPMachine LearningData Mining
S
Songchun Zhu
Beijing Institute for General Artificial Intelligence
B
Bo Zhao
Shanghai Jiao Tong University
Z
Zilong Zheng
Beijing Institute for General Artificial Intelligence