π€ AI Summary
Existing trajectory-driven video generation methods struggle to model multi-object interactions in complex robotic manipulation tasks, often leading to entangled features in occluded regions and degraded visual fidelity. To address this, we propose RoboMasterβa novel framework that, for the first time, decomposes manipulation into three sequential phases: pre-interaction, interaction, and post-interaction. Within each phase, motion representations are dynamically disentangled according to the dominant agent (robotic arm vs. manipulated object), thereby mitigating multi-object feature coupling. We further introduce an appearance- and shape-aware implicit representation to ensure semantic consistency across phases. Built upon a video diffusion model, RoboMaster integrates phase-wise trajectory conditioning and is fine-tuned on the Bridge V2 dataset. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches on both Bridge V2 and real-world robotic scenarios, achieving simultaneous improvements in trajectory controllability, visual fidelity, and motion plausibility.
π Abstract
Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.