🤖 AI Summary
This work systematically evaluates the capability of modern vision-language models (VLMs) to solve Rebus puzzles—a challenging multimodal abstract reasoning task requiring joint image understanding, spatial relation reasoning, phonological/cultural punning, and visual metaphor interpretation. To this end, the authors introduce the first manually annotated, structurally diverse English Rebus benchmark and conduct zero-shot and few-shot evaluations on leading open-source VLMs—including BLIP-2, LLaVA, and Qwen-VL. Experimental results reveal fundamental limitations in symbolic reasoning and lateral thinking: while VLMs achieve moderate accuracy on simple icon substitution, their performance drops below 15% on spatial configurations (e.g., “head over heels”) and phonological puns—far below human-level competence. Through fine-grained error attribution, the study establishes a novel diagnostic benchmark and analytical framework for assessing multimodal abstract reasoning in VLMs.
📝 Abstract
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head"over"heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.