Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This study evaluates the reasoning capabilities of large language models (LLMs) in non-STEM domains—specifically, complex humor understanding and explanation—where cultural context, linguistic ambiguity, and informal inference pose significant challenges. Method: We introduce HumorBench, the first systematic benchmark for humor reasoning, built upon an expert-annotated cartoon caption dataset. It quantifies LLM performance across three dimensions: humor element identification, conceptual association modeling, and hypothesis generation and verification. Contribution/Results: HumorBench is the first framework to incorporate culturally grounded, non-formal reasoning tasks—such as pun interpretation and contextual inference—into standardized LLM evaluation. Empirical analysis reveals that state-of-the-art models’ strengths in STEM reasoning generalize significantly to humor understanding. Moreover, chain-of-thought prompting yields heterogeneous improvements across architectures, indicating task- and model-specific efficacy. These findings establish a novel paradigm and empirical foundation for assessing cross-domain reasoning generalization in LLMs.

Technology Category

Application Category

📝 Abstract

We present HumorBench, a benchmark designed to evaluate large language models' (LLMs) ability to reason about and explain sophisticated humor in cartoon captions. As reasoning models increasingly saturate existing benchmarks in mathematics and science, novel and challenging evaluations of model intelligence beyond STEM domains are essential. Reasoning is fundamentally involved in text-based humor comprehension, requiring the identification of connections between concepts in cartoons/captions and external cultural references, wordplays, and other mechanisms. HumorBench includes approximately 300 unique cartoon-caption pairs from the New Yorker Caption Contest and Cartoonstock.com, with expert-annotated evaluation rubrics identifying essential joke elements. LLMs are evaluated based on their explanations towards the humor and abilities in identifying the joke elements. To perform well on this task, models must form and test hypotheses about associations between concepts, potentially backtracking from initial interpretations to arrive at the most plausible explanation. Our extensive benchmarking of current SOTA models reveals three key insights: (1) LLM progress on STEM reasoning transfers effectively to humor comprehension; (2) models trained exclusively on STEM reasoning data still perform well on HumorBench, demonstrating strong transferability of reasoning abilities; and (3) test-time scaling by increasing thinking token budgets yields mixed results across different models in humor reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to understand and explain humor in cartoons

Assessing non-STEM reasoning skills through sophisticated joke comprehension

Measuring model performance in identifying cultural references and wordplays

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs' humor comprehension via cartoon captions

Uses expert-annotated rubrics for joke element identification

Tests hypothesis formation and backtracking in humor reasoning

🔎 Similar Papers

No similar papers found.