🤖 AI Summary
Large vision-language models (LVLMs) exhibit substantial weaknesses in fundamental visual reasoning tasks—including symmetry detection, geometric transformations, and spatial reasoning—achieving only 51.1% average accuracy across 25 categories, significantly below human performance.
Method: We introduce a programmatically generated synthetic visual reasoning environment, supporting verifiable ground-truth annotations and covering diagrams, geometric puzzles, and graphical reasoning tasks. Within this environment, we propose Reinforcement Learning with Verifiable Rewards (RLVR), a novel fine-tuning paradigm that leverages task-specific, programmatically certifiable rewards to optimize LVLM behavior.
Contribution/Results: RLVR substantially improves in-domain performance and—critically—demonstrates strong zero-shot generalization to out-of-distribution benchmarks. This work provides the first systematic diagnosis of structural deficits in LVLMs’ core visual reasoning capabilities and establishes a new methodology for verifiable, generalizable visual reasoning modeling, accompanied by open-source infrastructure and high-quality synthetic data.
📝 Abstract
We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.