🤖 AI Summary
Current text-to-video models are limited to 2D representations and exhibit weak interactivity, failing to support the spatiotemporal environmental modeling required for robotic applications. This paper introduces the first language-guided editable 4D world simulator, integrating text-to-video generation, neural radiance fields (NeRF), and multi-view consistency optimization to enable object-level manipulation, trajectory-guided video synthesis, and feature-field distillation. Our key contribution is the first realization of real-time, language-instructed scene editing—without re-synthesis—while preserving dynamic consistency across viewpoints. Experiments demonstrate that the system achieves high visual fidelity while significantly improving spatiotemporal controllability, editing efficiency, and suitability for robot simulation tasks.
📝 Abstract
World models that support controllable
and editable spatiotemporal environments are valuable
for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While
recent text-to-video models generate realistic dynam ics, they are constrained to 2D views and offer limited
interaction. We introduce MorphoSim, a language guided framework that generates 4D scenes with
multi-view consistency and object-level controls. From
natural language instructions, MorphoSim produces
dynamic environments where objects can be directed,
recolored, or removed, and scenes can be observed
from arbitrary viewpoints. The framework integrates
trajectory-guided generation with feature field dis tillation, allowing edits to be applied interactively
without full re-generation. Experiments show that Mor phoSim maintains high scene fidelity while enabling
controllability and editability. The code is available
at https://github.com/eric-ai-lab/Morph4D.