🤖 AI Summary
Evaluating and enhancing large language models’ (LLMs) humor comprehension remains challenging due to lack of rigorous, bias-free benchmarks—especially in culturally nuanced domains like Japanese creative wordplay games (e.g., Oogiri).
Method: We introduce Oogiri-Master, the first large-scale, popularity-debiasing humor understanding benchmark, paired with Oogiri-Corpus—a human-annotated dataset where each prompt elicits ~100 responses, independently blind-rated by 100 annotators. We propose a high-capacity, unbiased humor annotation paradigm; quantitatively validate causal links between linguistic features (e.g., ambiguity resolution, incongruity reduction) and funniness; and design interpretable, objective funniness prediction metrics. Leveraging crowdsourced blind evaluation, robust aggregation, and feature-based regression modeling, we employ zero-shot and insight-augmented prompting strategies.
Results: Our approach achieves state-of-the-art performance on Oogiri-Master, approaching human-level accuracy; significantly improves humorous response generation quality; and establishes the first reproducible, comparable quantitative evaluation standard for LLM humor understanding.
📝 Abstract
Humor is a salient testbed for human-like creative thinking in large language models (LLMs). We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt, and ask the following research question: What makes such responses funny to humans? Previous work has offered only limited reliable means to answer this question. Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness. Thus, we introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in LLMs. Each prompt is paired with approximately 100 diverse candidate responses, and funniness is rated independently by approximately 100 human judges without access to others' ratings, reducing popularity bias and enabling robust aggregation. Using Oogiri-Corpus, we conduct a quantitative analysis of the linguistic factors associated with funniness, such as text length, ambiguity, and incongruity resolution, and derive objective metrics for predicting human judgments. Subsequently, we benchmark a range of LLMs and human baselines in Oogiri-Master, demonstrating that state-of-the-art models approach human performance and that insight-augmented prompting improves the model performance. Our results provide a principled basis for evaluating and advancing humor understanding in LLMs.