🤖 AI Summary
Existing object-pushing methods rely on predefined multi-step action sequences, limiting their generality and efficiency. This work proposes a unified pushing strategy that, for the first time, integrates a lightweight visual prompting mechanism with a flow-matching policy to generate responsive, multimodal pushing actions. These actions serve as low-level primitives that can be flexibly invoked by high-level planners. The approach seamlessly supports vision-language model (VLM)-guided planning frameworks and demonstrates significant performance gains over current baselines in experiments. It efficiently accomplishes complex tasks such as tabletop clearing, exhibiting strong reusability and adaptability across diverse task settings.
📝 Abstract
As one of the simplest non-prehensile manipulation skills, pushing has been widely studied as an effective means to rearrange objects. Existing approaches, however, typically rely on multi-step push plans composed of pre-defined pushing primitives with limited application scopes, which restrict their efficiency and versatility across different scenarios. In this work, we propose a unified pushing policy that incorporates a lightweight prompting mechanism into a flow matching policy to guide the generation of reactive, multimodal pushing actions. The visual prompt can be specified by a high-level planner, enabling the reuse of the pushing policy across a wide range of planning problems. Experimental results demonstrate that the proposed unified pushing policy not only outperforms existing baselines but also effectively serves as a low-level primitive within a VLM-guided planning framework to solve table-cleaning tasks efficiently.