ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current visual generative models exhibit limited capabilities in physical, causal, and complex spatial reasoning, yet existing evaluation methods often rely on superficial metrics or fragmented benchmarks that fail to accurately assess true reasoning proficiency. To address this gap, this work proposes a unified evaluation framework for visual generative reasoning, featuring cross-modal (image–video) task design, dual-track assessment of both generation processes and outputs, an evidence-driven automatic scoring mechanism, and fine-grained analysis grounded in cognitive dimensions. Experiments across more than twenty state-of-the-art models reveal substantial deficiencies even in the most advanced systems, thereby demonstrating the effectiveness and necessity of the proposed framework as a “stress test” for next-generation intelligent visual models.

Technology Category

Application Category

📝 Abstract

Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/

Problem

Research questions and friction points this paper is trying to address.

visual generative models

zero-shot visual reasoning

reasoning evaluation

performance mirage

cognitive reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual generative reasoning

zero-shot evaluation

cross-modal benchmark

process-aware assessment

cognitive diagnostic analysis

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

2024-06-14arXiv.orgCitations: 15

Have Large Vision-Language Models Mastered Art History?

2024-09-05arXiv.orgCitations: 2

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2024-02-09European Conference on Computer VisionCitations: 29