MMGR: Multi-Modal Generative Reasoning

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Current video foundation models achieve high perceptual quality but exhibit severe deficiencies in reasoning capabilities—particularly regarding physical plausibility, logical consistency, and spatiotemporal coherence—when deployed as generative world simulators. Standard evaluation metrics (e.g., FVD) overemphasize perceptual fidelity while neglecting causal validity and constraint satisfaction. Method: We introduce the first multimodal reasoning benchmark tailored for generative world models, proposing a five-dimensional evaluation framework: physical, logical, 2D spatial, 3D spatial, and temporal reasoning. It integrates abstract reasoning (ARC-AGI, Sudoku), embodied navigation, and physics commonsense tasks, alongside a novel video-image joint correctness metric and global consistency criteria. Results: Experiments expose fundamental limitations of state-of-the-art models: near-zero performance on abstract reasoning (<10% accuracy) and long-horizon planning, confirming their reliance on superficial statistics rather than causal modeling or state-consistent simulation.

Technology Category

Application Category

📝 Abstract

Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.

Problem

Research questions and friction points this paper is trying to address.

Evaluates video models' reasoning on physical, logical, and spatial constraints

Benchmarks models across abstract, embodied, and commonsense reasoning domains

Reveals performance gaps in abstract reasoning and long-horizon spatial planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal evaluation framework for reasoning abilities

Fine-grained metrics for holistic video and image correctness

Unified diagnostic benchmark for generative world models

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting