Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large multimodal models (LMMs) in multimodal humor understanding and comic narrative sequence recognition. To this end, we introduce PixelHumor—the first benchmark dataset specifically designed for webcomics, comprising 2,800 multi-panel comics—and systematically evaluate LMMs’ capacity to model cross-modal narrative logic and humor cognition. Methodologically, we propose a dual-task evaluation protocol: panel ordering and humor understanding, grounded in rigorously human-annotated ground truth. Our key contributions are: (1) establishing the first multimodal narrative benchmark explicitly targeting humor cognition in social intelligence, thereby filling a critical gap in multimodal evaluation; and (2) empirically demonstrating that state-of-the-art LMMs achieve only 61% accuracy on panel ordering—significantly below human performance—revealing fundamental deficiencies in visual–linguistic dynamic alignment, temporal causal reasoning, and cross-modal contextual integration.

Technology Category

Application Category

📝 Abstract
Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs' ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models' integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs' multimodal humor understanding in comics
Assessing narrative sequence recognition in visual-textual contexts
Identifying gaps in visual-textual integration for humor comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

PixelHumor benchmark dataset for multimodal humor
Evaluates visual-textual narrative sequencing in comics
Drives development of socially aware LMM interactions
🔎 Similar Papers
No similar papers found.