ViMU: Benchmarking Video Metaphorical Understanding

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
This work addresses the limitation of current video understanding models, which predominantly focus on literal content and struggle to interpret socially nuanced subtexts such as metaphor and irony. To bridge this gap, the authors introduce ViMU—the first benchmark specifically designed to evaluate models’ ability to comprehend implicit meanings in videos. ViMU employs a prompt-free design and combines open-ended and multiple-choice formats to assess how effectively models infer latent semantics from multimodal cues. The benchmark encompasses diverse subtext categories, including humor, sarcasm, and critique, and places particular emphasis on reasoning across cultural contexts. By doing so, ViMU establishes a systematic standard for measuring models’ capacity to move beyond surface-level perception toward deeper, context-aware cognitive understanding.
📝 Abstract
Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.
Problem

Research questions and friction points this paper is trying to address.

video understanding
metaphorical meaning
subtext
implicit meaning
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

video metaphorical understanding
subtext comprehension
multimodal reasoning
benchmark dataset
implicit meaning
🔎 Similar Papers