ViMU: Benchmarking Video Metaphorical Understanding

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the limitation of current video understanding models, which predominantly focus on literal content and struggle to interpret socially nuanced subtexts such as metaphor and irony. To bridge this gap, the authors introduce ViMU—the first benchmark specifically designed to evaluate models’ ability to comprehend implicit meanings in videos. ViMU employs a prompt-free design and combines open-ended and multiple-choice formats to assess how effectively models infer latent semantics from multimodal cues. The benchmark encompasses diverse subtext categories, including humor, sarcasm, and critique, and places particular emphasis on reasoning across cultural contexts. By doing so, ViMU establishes a systematic standard for measuring models’ capacity to move beyond surface-level perception toward deeper, context-aware cognitive understanding.

📝 Abstract

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.

Problem

Research questions and friction points this paper is trying to address.

video understanding

metaphorical meaning

subtext

implicit meaning

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

video metaphorical understanding

subtext comprehension

multimodal reasoning