On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenge that existing large language models struggle to balance creativity and interpretability in multimodal humorous image-caption generation, often resulting in captions lacking both humor and diversity. To overcome this limitation, the authors propose HOMER, a novel framework that integrates the General Theory of Verbal Humor (GTVH) into a multi-agent large language model collaboration mechanism for the first time. HOMER achieves high-quality humorous caption generation through conflict script extraction, retrieval-augmented hierarchical imagination tree construction, and conditional caption generation. Experimental results on two New Yorker cartoon datasets demonstrate that HOMER significantly outperforms state-of-the-art methods, achieving consistent improvements in both humor quality and caption diversity.

Technology Category

Application Category

📝 Abstract

Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.

Problem

Research questions and friction points this paper is trying to address.

humor caption generation

multi-modal humor

large language models

creativity

script opposition

Innovation

Methods, ideas, or system contributions that make the work stand out.

conflicting-script

humor generation

multi-role LLM collaboration