Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning

πŸ“… 2026-01-06
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses a critical gap in current vision-language models, which, despite strong performance on standard benchmarks, often rely on superficial recognition and lack the capacity for deep reasoning over implicit visual cues. To evaluate abstract reasoning and cross-lingual image-to-phrase inference, the authors introduce a novel benchmark based on visual riddles, comprising 1,343 unstructured, multilingual puzzles that emphasize hypothesis generation and revision, non-literal mappings, distractors, and contextual relationships. Leveraging an open-ended, human-aligned evaluation protocol augmented with lightweight assistance mechanisms, the study systematically assesses the reasoning processes of state-of-the-art multilingual vision-language models. Experimental results reveal a striking limitation: even the best models achieve only 60.27% accuracy, underscoring their substantial shortcomings in abstract and cross-lingual reasoning tasks and highlighting the benchmark’s rigor and necessity.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention, abstraction, and associative inference. The benchmark spans English, Persian, Arabic, and cross-lingual puzzles. We evaluate state-of-the-art VLMs using an open-ended, human-aligned protocol that probes hypothesis formation and revision under lightweight assistance. Results reveal substantial performance gaps, especially on abstract and cross-lingual puzzles, highlighting limitations in current models'ability to construct and search over appropriate conceptual representations for flexible image-to-phrase inference; maximum accuracy reaches only 60.27%.
Problem

Research questions and friction points this paper is trying to address.

visual word puzzles
image-to-phrase reasoning
vision-language models
multilingual benchmark
conceptual representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual word puzzles
image-to-phrase reasoning
multilingual benchmark
conceptual abstraction
vision-language models
πŸ”Ž Similar Papers
No similar papers found.
A
Ali Najar
Computer Engineering Department, Sharif University of Technology, Iran
A
Alireza Mirrokni
Computer Engineering Department, Sharif University of Technology, Iran
A
Arshia Izadyari
Computer Engineering Department, Sharif University of Technology, Iran
S
Sadegh Mohammadian
Computer Engineering Department, Sharif University of Technology, Iran
A
Amir Homayoon Sharifizade
Computer Engineering Department, Sharif University of Technology, Iran
A
Asal Meskin
Computer Engineering Department, Sharif University of Technology, Iran
M
Mobin Bagherian
Computer Engineering Department, Sharif University of Technology, Iran
Ehsaneddin Asgari
Ehsaneddin Asgari
Scientist at QCRI, UC Berkeley PhD Alum., Prev@ Helmholtz Center, MIT-CSAIL, MIT-BCS, LMU, EPFL, SUT
Natural Language ProcessingBioinformaticsDeep LearningDigital HumanitiesMachine Learning