Coding Agent Is Good As World Simulator

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

Existing video world models often produce physically implausible results due to the absence of explicit physical constraints, particularly in terms of contact stability, shape fidelity, and motion consistency. This work proposes the first executable simulation framework based on multi-agent collaboration, which translates natural language instructions into structured scenes and iteratively refines executable physics simulation code through the coordinated efforts of four specialized agents: planner, code generator, visual reviewer, and physics analyzer. By replacing implicit video generation with explicit, interpretable code, the method ensures both high fidelity to user instructions and strict adherence to physical laws. Experimental results demonstrate that the proposed framework significantly outperforms state-of-the-art video generation models in physical accuracy, instruction following, and visual quality, with successful applications demonstrated in autonomous driving and embodied robotics tasks.

📝 Abstract

World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion. In this paper, we present an agentic framework constructing physics-based world models through executable simulation code. The framework coordinates planning, code generation, visual review, and physics analysis agents. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints. Experimental results show that our framework outperforms advanced video-based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks.

Problem

Research questions and friction points this paper is trying to address.

world models

physical plausibility

video-based simulation

physics constraints

dynamic consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic framework

physics-based world model

executable simulation code