🤖 AI Summary
Existing statically generated 3D worlds lack editability and physical interactivity, limiting their utility for immersive content creation and embodied intelligence tasks. This work proposes a multimodal agent-based automatic conversion framework that decomposes scenes to identify manipulable objects, reconstructs geometry-aligned object-level meshes, and leverages 3D inpainting to restore the background. For the first time, this approach transforms monolithic, static 3D outputs into object-centric, editable, and physically interactive environments. By balancing global scene consistency with local object manipulability, the method enables object-level editing, realistic physical interactions, and execution of embodied tasks, substantially expanding the interactive application potential of generative 3D worlds.
📝 Abstract
Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.