Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of large multimodal models (LMMs) in multimodal humor understanding and comic narrative sequence recognition. To this end, we introduce PixelHumor—the first benchmark dataset specifically designed for webcomics, comprising 2,800 multi-panel comics—and systematically evaluate LMMs’ capacity to model cross-modal narrative logic and humor cognition. Methodologically, we propose a dual-task evaluation protocol: panel ordering and humor understanding, grounded in rigorously human-annotated ground truth. Our key contributions are: (1) establishing the first multimodal narrative benchmark explicitly targeting humor cognition in social intelligence, thereby filling a critical gap in multimodal evaluation; and (2) empirically demonstrating that state-of-the-art LMMs achieve only 61% accuracy on panel ordering—significantly below human performance—revealing fundamental deficiencies in visual–linguistic dynamic alignment, temporal causal reasoning, and cross-modal contextual integration.

Technology Category

Application Category

📝 Abstract

Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs' ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models' integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs' multimodal humor understanding in comics

Assessing narrative sequence recognition in visual-textual contexts

Identifying gaps in visual-textual integration for humor comprehension

Innovation

Methods, ideas, or system contributions that make the work stand out.

PixelHumor benchmark dataset for multimodal humor

Evaluates visual-textual narrative sequencing in comics

Drives development of socially aware LMM interactions

🔎 Similar Papers

No similar papers found.

Authors to Follow