SPHINX: A Synthetic Environment for Visual Perception and Reasoning

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large vision-language models (LVLMs) exhibit substantial weaknesses in fundamental visual reasoning tasks—including symmetry detection, geometric transformations, and spatial reasoning—achieving only 51.1% average accuracy across 25 categories, significantly below human performance. Method: We introduce a programmatically generated synthetic visual reasoning environment, supporting verifiable ground-truth annotations and covering diagrams, geometric puzzles, and graphical reasoning tasks. Within this environment, we propose Reinforcement Learning with Verifiable Rewards (RLVR), a novel fine-tuning paradigm that leverages task-specific, programmatically certifiable rewards to optimize LVLM behavior. Contribution/Results: RLVR substantially improves in-domain performance and—critically—demonstrates strong zero-shot generalization to out-of-distribution benchmarks. This work provides the first systematic diagnosis of structural deficits in LVLMs’ core visual reasoning capabilities and establishes a new methodology for verifiable, generalizable visual reasoning modeling, accompanied by open-source infrastructure and high-quality synthetic data.

Technology Category

Application Category

📝 Abstract

We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

Problem

Research questions and friction points this paper is trying to address.

Sphinx creates synthetic puzzles for visual perception evaluation

It benchmarks 25 visual reasoning tasks like symmetry and transformations

The environment addresses multimodal reasoning gaps in AI models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Procedurally generates puzzles with verifiable ground-truth solutions

Covers 25 visual reasoning task types for evaluation

Uses reinforcement learning with verifiable rewards for improvement

🔎 Similar Papers

No similar papers found.