🤖 AI Summary
This work addresses the fundamental trade-off in existing 3D world modeling between static scene generation and dynamic controllability: static methods lack agent interaction capabilities, while controllable entity models suffer from uncontrolled environments and insufficient action openness. To bridge this gap, we propose the first unified framework that jointly handles static world generation and controllable entity modeling—a natural language–driven, end-to-end approach enabling arbitrary user-specified agents to perform diverse, long-horizon, semantically grounded actions (e.g., walking, object interaction, free exploration) within arbitrary 3D Gaussian Splatting (3DGS) scenes. Our method leverages a pretrained video generator, integrates 3DGS scene encoding with text instruction alignment, and introduces motion-augmented conditional autoregressive modeling. Experiments demonstrate significant improvements over prior art in visual fidelity, agent identity consistency, action controllability, and temporal coherence, with strong generalization across unseen agents and scenes.
📝 Abstract
Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object-centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.