Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses the opacity of existing approaches to humor understanding by proposing the Incongruity-Resolution-Satisfaction (IRS) framework, grounded in the incongruity-resolution theory. It decomposes humor comprehension into three supervisable reasoning stages: incongruity modeling, resolution modeling, and preference alignment. The framework explicitly guides multimodal large language models (ranging from 7B to 72B parameters) through structured reasoning trajectories that bridge visual perception to humorous interpretation. By integrating vision–language alignment with human preference optimization, the method substantially outperforms current baselines on The New Yorker cartoon caption contest, with the largest model approaching expert-level performance in ranking tasks and demonstrating strong zero-shot transfer capabilities on external benchmarks.

Technology Category

Application Category

📝 Abstract

Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolution Supervision), a framework that decomposes humor understanding into three components: incongruity modeling, which identifies mismatches in the visual scene; resolution modeling, which constructs coherent reinterpretations of these mismatches; and preference alignment, which evaluates candidate interpretations under human judgments. Grounded in incongruity-resolution theory and expert captionist practice, IRS supervises intermediate reasoning process through structured traces that make the path from visual perception to humorous interpretation explicit and learnable. Across 7B, 32B, and 72B models on NYCC, IRS outperforms strong open and closed multimodal baselines across caption matching and ranking tasks, with our largest model approaching expert-level performance on ranking. Zero-shot transfer to external benchmarks shows that IRS learns generalizable reasoning patterns. Our results suggest that supervising reasoning structure, rather than scale alone, is key for reasoning-centric tasks.

Problem

Research questions and friction points this paper is trying to address.

multimodal humor understanding

incongruity-resolution

reasoning process

cartoon captioning

structured supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incongruity-Resolution

Multimodal Humor Understanding

Reasoning Supervision