🤖 AI Summary
This work addresses the limitations of large multimodal models (LMMs) in multimodal humor understanding and comic narrative sequence recognition. To this end, we introduce PixelHumor—the first benchmark dataset specifically designed for webcomics, comprising 2,800 multi-panel comics—and systematically evaluate LMMs’ capacity to model cross-modal narrative logic and humor cognition. Methodologically, we propose a dual-task evaluation protocol: panel ordering and humor understanding, grounded in rigorously human-annotated ground truth. Our key contributions are: (1) establishing the first multimodal narrative benchmark explicitly targeting humor cognition in social intelligence, thereby filling a critical gap in multimodal evaluation; and (2) empirically demonstrating that state-of-the-art LMMs achieve only 61% accuracy on panel ordering—significantly below human performance—revealing fundamental deficiencies in visual–linguistic dynamic alignment, temporal causal reasoning, and cross-modal contextual integration.
📝 Abstract
Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs' ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models' integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.