🤖 AI Summary
In robot imitation learning, direct mapping from visual observations to actions remains challenging due to the inherent ambiguity and indirectness of visual-to-motor correspondence.
Method: This paper proposes a novel “motion-before-action” paradigm: first inferring a sequence of future object poses from visual input, then generating manipulation actions conditioned on this predicted motion trajectory. We introduce a plug-and-play dual-diffusion module (MBA) that decouples object motion representation learning from action policy modeling, and design a cascaded diffusion framework for vision-driven pose forecasting and conditional action generation.
Contribution/Results: The method significantly improves performance on manipulation tasks—including grasping, pushing, and pulling—in both simulation and real-robot experiments. It is compatible with existing diffusion-based policies, exhibits strong generalization across objects and scenes, and offers flexible deployment due to its modular architecture.
📝 Abstract
Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations. We propose MBA (Motion Before Action), a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks. Project page: https://selen-suyue.github.io/MBApage/