Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

πŸ“… 2025-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing trajectory-driven video generation methods struggle to model multi-object interactions in complex robotic manipulation tasks, often leading to entangled features in occluded regions and degraded visual fidelity. To address this, we propose RoboMasterβ€”a novel framework that, for the first time, decomposes manipulation into three sequential phases: pre-interaction, interaction, and post-interaction. Within each phase, motion representations are dynamically disentangled according to the dominant agent (robotic arm vs. manipulated object), thereby mitigating multi-object feature coupling. We further introduce an appearance- and shape-aware implicit representation to ensure semantic consistency across phases. Built upon a video diffusion model, RoboMaster integrates phase-wise trajectory conditioning and is fine-tuned on the Bridge V2 dataset. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches on both Bridge V2 and real-world robotic scenarios, achieving simultaneous improvements in trajectory controllability, visual fidelity, and motion plausibility.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
Problem

Research questions and friction points this paper is trying to address.

Modeling multi-object interaction in robotic manipulation videos
Mitigating multi-feature entanglement in overlapping regions
Ensuring subject semantic consistency in generated videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative trajectory control for multi-object interaction
Decomposing interaction into three distinct sub-stages
Appearance- and shape-aware latent representations
πŸ”Ž Similar Papers
No similar papers found.