🤖 AI Summary
Autonomous robotic manipulation critically requires world models capable of understanding the physical mechanisms and dynamics of the environment; however, existing definitions remain ambiguous and capability boundaries ill-defined, hindering both generality and practical deployment.
Method: This work adopts a functional perspective to systematically characterize the core capabilities of world models for robotic manipulation, proposing a task-driven, unified framework that integrates state representation learning, dynamic modeling, sequence prediction, closed-loop control, and model-based reinforcement learning.
Contribution: We formally identify the essential components and functional roles of world models within the perception–prediction–control loop for the first time. Moving beyond conventional static representations, we emphasize dynamic modeling, causal reasoning, and online planning as critical capabilities. Furthermore, we provide a clear capability taxonomy and a systematic construction methodology, enabling the development of generalizable, scalable world models for robotics.
📝 Abstract
Autonomous agents are increasingly expected to operate in complex, dynamic, and uncertain environments, performing tasks such as manipulation, navigation, and decision-making. Achieving these capabilities requires agents to understand the underlying mechanisms and dynamics of the world, moving beyond reactive control or simple replication of observed states. This motivates the development of world models as internal representations that encode environmental states, capture dynamics, and support prediction, planning, and reasoning. Despite growing interest, the definition, scope, architectures, and essential capabilities of world models remain ambiguous. In this survey, we go beyond prescribing a fixed definition and limiting our scope to methods explicitly labeled as world models. Instead, we examine approaches that exhibit the core capabilities of world models through a review of methods in robotic manipulation. We analyze their roles across perception, prediction, and control, identify key challenges and solutions, and distill the core components, capabilities, and functions that a fully realized world model should possess. Building on this analysis, we aim to motivate further development toward generalizable and practical world models for robotics.