🤖 AI Summary
To address weak adaptability and non-robust path planning of domestic robots under data-scarce conditions, this paper proposes an affordance-aware conditional flow matching framework. Methodologically: (1) it unifies visual affordance modeling and trajectory generation within the flow matching paradigm—first of its kind; (2) it employs parameter-efficient prompt tuning with a frozen vision encoder and learnable text prompts, decoupling perception from action generation; (3) it introduces conditional flow matching to multi-task robotic manipulation for the first time, enabling stable training, millisecond-scale inference, and strong generalization. Evaluated on a real-world Activities of Daily Living (ADL) multi-task dataset, our method significantly outperforms various fine-tuning baselines. Compared to behavior cloning, it demonstrates superior robustness and achieves order-of-magnitude faster inference. Its generalization performance matches that of diffusion models and surpasses them on most tasks.
📝 Abstract
We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot action trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot action trajectories guided by affordances in a supervised flow matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot action trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance achieves competitive performance and even outperforms some other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot action trajectories with flow matching leads to consistently favorable results in several robot manipulation benchmarks than some alternative behavior cloning methods. This includes more stable training and evaluation, and noticeably faster inference, while maintaining comparable generalization performance to diffusion policy, where flow matching performs marginally better in most cases. Our framework seamlessly unifies affordance learning and action generation with flow matching for robot manipulation.