Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models struggle with multi-step cognitive reasoning tasks that require integrating visual cues, linguistic priors, and abstract mappings. To address this limitation, this work proposes RebusBench—the first benchmark designed to systematically evaluate neuro-symbolic cognitive reasoning capabilities of vision-language models under implicit visual conditions. RebusBench comprises 1,164 rebus images that demand semantic composition through the fusion of perceptual input and world knowledge. Experimental results reveal that state-of-the-art models, including Qwen, InternVL, and LLaVA, achieve less than 10% exact-match accuracy and below 20% semantic accuracy on this benchmark. Moreover, neither scaling model size nor in-context learning yields significant performance gains, underscoring a fundamental deficiency in current architectures’ ability to perform multimodal abstract mapping.
📝 Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.
Problem

Research questions and friction points this paper is trying to address.

visual reasoning
rebus puzzle
cognitive gap
neurosymbolic reasoning
implicit meaning
Innovation

Methods, ideas, or system contributions that make the work stand out.

rebus reasoning
cognitive visual reasoning
neurosymbolic integration
vision-language models
implicit semantic understanding
🔎 Similar Papers
No similar papers found.