Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses a critical gap in current vision-language models, which, despite strong performance on standard benchmarks, often rely on superficial recognition and lack the capacity for deep reasoning over implicit visual cues. To evaluate abstract reasoning and cross-lingual image-to-phrase inference, the authors introduce a novel benchmark based on visual riddles, comprising 1,343 unstructured, multilingual puzzles that emphasize hypothesis generation and revision, non-literal mappings, distractors, and contextual relationships. Leveraging an open-ended, human-aligned evaluation protocol augmented with lightweight assistance mechanisms, the study systematically assesses the reasoning processes of state-of-the-art multilingual vision-language models. Experimental results reveal a striking limitation: even the best models achieve only 60.27% accuracy, underscoring their substantial shortcomings in abstract and cross-lingual reasoning tasks and highlighting the benchmark’s rigor and necessity.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention, abstraction, and associative inference. The benchmark spans English, Persian, Arabic, and cross-lingual puzzles. We evaluate state-of-the-art VLMs using an open-ended, human-aligned protocol that probes hypothesis formation and revision under lightweight assistance. Results reveal substantial performance gaps, especially on abstract and cross-lingual puzzles, highlighting limitations in current models'ability to construct and search over appropriate conceptual representations for flexible image-to-phrase inference; maximum accuracy reaches only 60.27%.

Problem

Research questions and friction points this paper is trying to address.

visual word puzzles

image-to-phrase reasoning

vision-language models

multilingual benchmark

conceptual representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual word puzzles

image-to-phrase reasoning

multilingual benchmark