Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
Current AI agents lack a unified and effective capability for modeling environmental dynamics, hindering their reliable performance in complex physical, digital, social, and scientific settings. This work proposes a “Hierarchy × Laws” framework that structures a three-level world model system—comprising a predictor (L1), simulator (L2), and evolver (L3)—and integrates four categories of domain-specific laws. Synthesizing over 400 studies, the project introduces a novel taxonomy to unify world modeling concepts across disciplines, establishes decision-oriented evaluation principles, and provides a reproducible benchmark suite. By surveying more than 100 systems spanning reinforcement learning, video generation, GUI/Web agents, multi-agent simulation, and AI-driven scientific discovery, the study delineates methodological approaches, failure modes, and evaluation criteria for each configuration, offering both a theoretical foundation and a practical roadmap toward building advanced agents capable of simulating and reshaping their environments.

Technology Category

Application Category

📝 Abstract
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.
Problem

Research questions and friction points this paper is trying to address.

world modeling
environment dynamics
AI agents
predictive models
governing laws
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic World Modeling
Levels × Laws Taxonomy
Predictive Environment Models
Model-Based Reinforcement Learning
Autonomous Model Revision
M
Meng Chu
Hong Kong University of Science and Technology
X
Xuan Billy Zhang
National University of Singapore
Kevin Qinghong Lin
Kevin Qinghong Lin
University of Oxford; National U. of Singapore
Vision and LanguageVideo UnderstandingAI Agent
Lingdong Kong
Lingdong Kong
National University of Singapore
Computer VisionDeep Learning
Jize Zhang
Jize Zhang
Assistant Professor, The Hong Kong University of Science and Technology (HKUST)
Uncertainty QuantificationStorm SurgeSurrogate ModelingCoastal Hazards
T
Teng Tu
National University of Singapore
W
Weijian Ma
National University of Singapore
Ziqi Huang
Ziqi Huang
Ph.D. Student, MMLab@NTU, Nanyang Technological University
Computer Vision
Senqiao Yang
Senqiao Yang
The Chinese University of Hong Kong
Wei Huang
Wei Huang
The University of Hong Kong | NVIDIA
Efficient Deep LearningLarge Language ModelWearable AIReinforcement Learning
Yeying Jin
Yeying Jin
Tencent | National University of Singapore
Computer VisionAIGCGenAIMLLMVLM
Z
Zhefan Rao
Hong Kong University of Science and Technology
J
Jinhui Ye
Hong Kong University of Science and Technology
Xinyu Lin
Xinyu Lin
National University of Singapore
recommendation
Xichen Zhang
Xichen Zhang
The Hong Kong University of Science and Technology
Q
Qisheng Hu
Nanyang Technological University
Shuai Yang
Shuai Yang
The Hong Kong University of Science and Technology, Guangzhou
Computer VisionGenerative ModelsEfficient Deep Learning
L
Leyang Shen
National University of Singapore
Wei Chow
Wei Chow
Zhejiang University
vision-languagegenerative AI
Yifei Dong
Yifei Dong
KTH Royal Institute of Technology
Robotic manipulation
Fengyi Wu
Fengyi Wu
Unknown affiliation
Quanyu Long
Quanyu Long
Nanyang Technological University
TLNLP
Bin Xia
Bin Xia
The Chinese University of Hong Kong
AIGCLLMImage RestorationModel Compression
Shaozuo Yu
Shaozuo Yu
CUHK
Computer ScienceComputer VisionNatural Language Processing
Mingkang Zhu
Mingkang Zhu
The Chinese University of Hong Kong
Machine LearningLarge Language ModelPost Training