When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (VLMs) exhibit significant deficiencies in understanding contradiction-based humor—such as “YES/BUT” juxtaposition cartoons—that requires comparative reasoning, thereby limiting their human-like inferential and cultural comprehension capabilities. To address this, we introduce YesBut (V2), a multilingual, multicultural cartoon benchmark, and design four progressively challenging tasks to systematically evaluate end-to-end narrative understanding—from superficial perception to cross-modal contrastive reasoning. We propose the first fine-grained narrative understanding framework tailored to contradiction-based humor, uncovering a structural deficit in VLMs’ contrastive reasoning (42.6% lower than human performance). Our method innovatively integrates social-knowledge injection, multimodal contrastive learning, and hallucination-aware key-element localization. This yields up to 18.3% absolute accuracy gain on critical tasks, markedly improving robustness in identifying and reasoning about contradictory elements.

Technology Category

Application Category

📝 Abstract
Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI's ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YesBut (V2), a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning, with particular emphasis on comparative reasoning between contradictory elements. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, comparative analysis and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs' understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative understanding though comparative reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' ability to understand contradictory humor in comics
Assessing comparative reasoning skills in large vision-language models
Identifying weaknesses in cultural and creative expression comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces YesBut benchmark for humor analysis
Evaluates VLMs on comparative reasoning tasks
Proposes text-based training and knowledge augmentation
🔎 Similar Papers
No similar papers found.
Tuo Liang
Tuo Liang
Case Western
VLMVisual ReasoningVisual Hallucination
Z
Zhe Hu
Department of Computing, The Hong Kong Polytechnic University
J
Jing Li
Department of Computing, The Hong Kong Polytechnic University
H
Hao Zhang
Computer and Data Sciences Department, Case Western Reserve University
Yiren Lu
Yiren Lu
PhD Candidate, Case Western Reserve University
3D VisionSpatial AIRobotics
Yunlai Zhou
Yunlai Zhou
CASE WESTERN RESERVE UNIVERSITY
Yiran Qiao
Yiran Qiao
Case Western Reserve University
Disheng Liu
Disheng Liu
Case Western Reserve University
Computer Vision
J
Jeirui Peng
Computer and Data Sciences Department, Case Western Reserve University
J
Jing Ma
Computer and Data Sciences Department, Case Western Reserve University
Y
Yu Yin
Computer and Data Sciences Department, Case Western Reserve University