Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
Current visual generative models exhibit significant limitations in spatial reasoning, state persistence, long-term consistency, and causal understanding, hindering their ability to produce structurally coherent and intelligently behaving content. This work proposes a paradigm shift from appearance-based synthesis toward intelligent visual generation, introducing a novel five-level generative capability taxonomy—from atomic generation to world modeling—that emphasizes the integration of structure, dynamics, domain knowledge, and causality. By leveraging key technical components including a unified understanding-generation architecture, flow matching, enhanced representations, post-training optimization, and synthetic data distillation, the study establishes a capability-centered evaluation framework. This framework exposes the prevailing overreliance on perceptual quality metrics while neglecting structural and causal deficiencies, thereby charting a roadmap for the development of next-generation intelligent visual generation systems.
📝 Abstract
Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
persistent state
long-horizon consistency
causal understanding
intelligent visual generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intelligent Visual Generation
World-Modeling Generation
Agentic Generation
Causal Understanding
Capability-Centered Evaluation
Keming Wu
Keming Wu
Ph.D. Student, Tsinghua University
Computer VisionVision Language ModelsGenerative AI
Zuhao Yang
Zuhao Yang
Nanyang Technological University
video understandingvideo generation
Kaichen Zhang
Kaichen Zhang
Nanyang Technological University
VLMsComputer VisionMulti-modality
Shizun Wang
Shizun Wang
National University of Singapore
Computer visionMachine learning
Haowei Zhu
Haowei Zhu
Tsinghua University
Computer Vision
Sicong Leng
Sicong Leng
Nanyang Technological University
Multi-modal Learning
Z
Zhongyu Yang
Nanyang Technological University
Qijie Wang
Qijie Wang
School of Software, Tsinghua University
S
Sudong Wang
Hong Kong University of Science and Technology (Guangzhou)
Z
Ziting Wang
StepFun
Zili Wang
Zili Wang
StepFun LLM Researcher & M-A-P
Large Language ModelsCode Intelligence
Hui Zhang
Hui Zhang
Fudan University
AIGCComputer VisionAnomaly Detection
Haonan Wang
Haonan Wang
PhD Student, School of Computing, National University of Singapore
Machine LearningGenerative AIData-Centric AIData Mining
Hang Zhou
Hang Zhou
Baidu Inc.
Computer VisionAudio ProcessingMultimodal Learning
Y
Yifan Pu
Tsinghua University
Xingxuan Li
Xingxuan Li
MiroMind, Nanyang Technological University, DAMO Alibaba Group
natural language processinglarge language modelsknowledge grounding
Fangneng Zhan
Fangneng Zhan
MIT
Neural RenderingGenerative Models
B
Bo Li
Nanyang Technological University
Lidong Bing
Lidong Bing
MiroMind, Alibaba DAMO, Tencent, CMU, CUHK
Natural Language ProcessingLarge Language ModelsLarge Multimodal Models
Yuxin Song
Yuxin Song
Baidu
Computer VisionVision-Language ModelGenerative ModelVideo Understanding
Ziwei Liu
Ziwei Liu
Associate Professor, Nanyang Technological University
Computer VisionMachine LearningComputer Graphics
Wenhu Chen
Wenhu Chen
Assistant Professor at University of Waterloo
Natural Language ProcessingArtificial IntelligenceDeep Learning
J
Jingdong Wang
Baidu
Xinchao Wang
Xinchao Wang
National University of Singapore
Machine LearningAIComputer VisionImage ProcessingNatural Language Processing
Xiaojuan Qi
Xiaojuan Qi
Assistant Professor, The University of Hong Kong
3D VisionDeep learningArtificial IntelligenceMedical Image Analysis