🤖 AI Summary
Current visual generative models exhibit significant limitations in spatial reasoning, state persistence, long-term consistency, and causal understanding, hindering their ability to produce structurally coherent and intelligently behaving content. This work proposes a paradigm shift from appearance-based synthesis toward intelligent visual generation, introducing a novel five-level generative capability taxonomy—from atomic generation to world modeling—that emphasizes the integration of structure, dynamics, domain knowledge, and causality. By leveraging key technical components including a unified understanding-generation architecture, flow matching, enhanced representations, post-training optimization, and synthetic data distillation, the study establishes a capability-centered evaluation framework. This framework exposes the prevailing overreliance on perceptual quality metrics while neglecting structural and causal deficiencies, thereby charting a roadmap for the development of next-generation intelligent visual generation systems.
📝 Abstract
Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.