SPHINX: A Synthetic Environment for Visual Perception and Reasoning

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) exhibit substantial weaknesses in fundamental visual reasoning tasks—including symmetry detection, geometric transformations, and spatial reasoning—achieving only 51.1% average accuracy across 25 categories, significantly below human performance. Method: We introduce a programmatically generated synthetic visual reasoning environment, supporting verifiable ground-truth annotations and covering diagrams, geometric puzzles, and graphical reasoning tasks. Within this environment, we propose Reinforcement Learning with Verifiable Rewards (RLVR), a novel fine-tuning paradigm that leverages task-specific, programmatically certifiable rewards to optimize LVLM behavior. Contribution/Results: RLVR substantially improves in-domain performance and—critically—demonstrates strong zero-shot generalization to out-of-distribution benchmarks. This work provides the first systematic diagnosis of structural deficits in LVLMs’ core visual reasoning capabilities and establishes a new methodology for verifiable, generalizable visual reasoning modeling, accompanied by open-source infrastructure and high-quality synthetic data.

Technology Category

Application Category

📝 Abstract
We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

Sphinx creates synthetic puzzles for visual perception evaluation
It benchmarks 25 visual reasoning tasks like symmetry and transformations
The environment addresses multimodal reasoning gaps in AI models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Procedurally generates puzzles with verifiable ground-truth solutions
Covers 25 visual reasoning task types for evaluation
Uses reinforcement learning with verifiable rewards for improvement
🔎 Similar Papers
No similar papers found.
M
Md Tanvirul Alam
Rochester Institute of Technology, Rochester, NY , USA
S
Saksham Aggarwal
Rochester Institute of Technology, Rochester, NY , USA
J
Justin Yang Chae
University of Washington, Seattle, WA, USA
Nidhi Rastogi
Nidhi Rastogi
Assistant Professor, Rochester Institute of Technology, NY
CybersecurityArtificial IntelligenceAutonomous VehiclesGraph AnalyticsApplied Machine Learning