Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This work addresses the challenge that existing models struggle to simultaneously achieve cultural appropriateness, image relevance, and humor quality in generating humorous image captions. To this end, the paper introduces a novel task of culture-aware humorous image captioning, proposes the first six-dimensional evaluation framework, and designs a staged alignment architecture. This framework integrates a referee-based GRPO algorithm with a degradation-aware prototype repulsion constraint, enabling effective cross-cultural transfer from Western high-resource pretraining using only limited Eastern cultural supervision data. Experimental results demonstrate that the proposed method significantly outperforms baseline approaches overall, particularly excelling in balancing contextual appropriateness, image relevance, and humor under cultural constraints.

Technology Category

Application Category

📝 Abstract

Recent multimodal large language models have shown promising ability in generating humorous captions for images, yet they still lack stable control over explicit cultural context, making it difficult to jointly maintain image relevance, contextual appropriateness, and humor quality under a specified cultural background. To address this limitation, we introduce a new multimodal generation task, culture-aware humorous captioning, which requires a model to generate a humorous caption conditioned on both an input image and a target cultural context. Captions generated under different cultural contexts are not expected to share the same surface form, but should remain grounded in similar visual situations or humorous rationales.To support this task, we establish a six-dimensional evaluation framework covering image relevance, contextual fit, semantic richness, reasonableness, humor, and creativity. We further propose a staged alignment framework that first initializes the model with high-resource supervision under the Western cultural context, then performs multi-dimensional preference alignment via judge-based GRPO with a Degradation-aware Prototype Repulsion Constraint to mitigate reward hacking in open-ended generation, and finally adapts the model to the Eastern cultural context with a small amount of supervision. Experimental results show that our method achieves stronger overall performance under the proposed evaluation framework, with particularly large gains in contextual fit and a better balance between image relevance and humor under cultural constraints.

Problem

Research questions and friction points this paper is trying to address.

culture-aware

humorous captioning

multimodal humor generation

cultural context

image relevance

Innovation

Methods, ideas, or system contributions that make the work stand out.

culture-aware humorous captioning

multimodal humor generation

preference alignment