🤖 AI Summary
Training robotic policies in the real world is costly and difficult to scale, while existing generative simulation approaches struggle with long-horizon tasks due to limited logical coherence and inadequate handling of dynamic physical uncertainties. This work proposes AGT-World, a novel framework that, for the first time, models the task space as an affordance-based structured graph, enabling hierarchical decomposition of complex objectives. By integrating atomic action primitives, vision-language model reasoning, and geometric verification into a hybrid feedback mechanism, AGT-World establishes a self-evolving closed loop that autonomously refines policies and generates interactive simulation environments. The approach significantly improves task success rates and generalization capabilities, thereby supporting scalable embodied intelligence learning.
📝 Abstract
Training robotic policies directly in the real world is expensive and unscalable. Although generative simulation enables large-scale data synthesis, current approaches often fail to generate logically coherent long-horizon tasks and struggle with dynamic physical uncertainties due to open-loop execution. To address these challenges, we propose Affordance-Graphed Task Worlds (AGT-World), a unified framework that autonomously constructs interactive simulated environments and corresponding robot task policies based on real-world observations. Unlike methods relying on random proposals or static replication, AGT-World formalizes the task space as a structured graph, enabling the precise, hierarchical decomposition of complex goals into theoretically grounded atomic primitives. Furthermore, we introduce a Self-Evolution mechanism with hybrid feedback to autonomously refine policies, combining Vision-Language Model reasoning and geometric verification. Extensive experiments demonstrate that our method significantly outperforms in success rates and generalization, achieving a self-improving cycle of proposal, execution, and correction for scalable robot learning.