🤖 AI Summary
Existing world models for embodied intelligence suffer performance limitations in long-horizon hybrid tasks—such as those intertwining navigation and manipulation—due to the tight coupling between world and ego dynamics. This work proposes a world-ego modeling paradigm that decouples these dynamics through three complementary perspectives: motion, semantics, and intent. We introduce a unified World-Ego Model (WEM), integrating an implicitly disentangled planner with a cascaded parallel mixture-of-experts diffusion generator. For the first time, we formalize this decoupling paradigm, devise three distinct decoupling strategies, and establish HTEWorld, the first benchmark specifically designed for long-horizon hybrid world modeling. Evaluated on HTEWorld—which comprises 125K video clips and 300 multi-episode trajectories—WEM achieves substantial performance gains while maintaining competitive results on pure manipulation benchmarks.
📝 Abstract
World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.