🤖 AI Summary
Multimodal humor understanding necessitates the integration of metaphorical, sociocultural, and commonsense knowledge—yet its automatic identification and selection remains challenging. This paper introduces the first unsupervised multimodal humor explanation framework: grounded in the information bottleneck principle, it automatically extracts and iteratively refines cross-modal latent knowledge from CLIP and large language models; knowledge distillation and counterfactual reasoning enable self-guided knowledge selection and semantic alignment—without human annotations. Our method significantly outperforms both supervised and zero-shot baselines across three benchmarks, empirically validating the efficacy of knowledge refinement for enhancing humor explainability. The core contribution lies in pioneering the application of the information bottleneck to unsupervised humor explanation, thereby enabling autonomous cross-modal knowledge filtering and interpretable explanation generation.
📝 Abstract
Humor is prevalent in online communications and it often relies on more than one modality (e.g., cartoons and memes). Interpreting humor in multimodal settings requires drawing on diverse types of knowledge, including metaphorical, sociocultural, and commonsense knowledge. However, identifying the most useful knowledge remains an open question. We introduce method{}, a method inspired by the information bottleneck principle that elicits relevant world knowledge from vision and language models which is iteratively refined for generating an explanation of the humor in an unsupervised manner. Our experiments on three datasets confirm the advantage of our method over a range of baselines. Our method can further be adapted in the future for additional tasks that can benefit from eliciting and conditioning on relevant world knowledge and open new research avenues in this direction.