Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work systematically evaluates the capability of modern vision-language models (VLMs) to solve Rebus puzzles—a challenging multimodal abstract reasoning task requiring joint image understanding, spatial relation reasoning, phonological/cultural punning, and visual metaphor interpretation. To this end, the authors introduce the first manually annotated, structurally diverse English Rebus benchmark and conduct zero-shot and few-shot evaluations on leading open-source VLMs—including BLIP-2, LLaVA, and Qwen-VL. Experimental results reveal fundamental limitations in symbolic reasoning and lateral thinking: while VLMs achieve moderate accuracy on simple icon substitution, their performance drops below 15% on spatial configurations (e.g., “head over heels”) and phonological puns—far below human-level competence. Through fine-grained error attribution, the study establishes a novel diagnostic benchmark and analytical framework for assessing multimodal abstract reasoning in VLMs.

Technology Category

Application Category

📝 Abstract

Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head"over"heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' ability to solve rebus puzzles

Assessing multi-modal abstraction in vision-language models

Identifying VLMs' limitations in abstract visual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hand-generated diverse rebus benchmark

Analyzing VLM multi-modal abstraction

Testing symbolic reasoning capabilities

🔎 Similar Papers

No similar papers found.