SViP: Sequencing Bimanual Visuomotor Policies with Object-Centric Motion Primitives

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address poor generalization of visuomotor policies in bimanual manipulation and error accumulation in long-horizon tasks, this paper proposes SViP: a semantic scene graph-guided framework for action segmentation, integrating task and motion planning (TAMP)-informed constraint modeling with object-centric parametric motion primitives and a learnable switching-condition generator—enabling end-to-end policy learning without pose estimation. SViP unifies imitation learning, semantic monitoring, and script-based primitive generation within a structured reasoning architecture. Evaluated on real-world bimanual manipulation, it achieves strong generalization to out-of-distribution initial states and unseen tasks using only 20 demonstrations. The method enables reliable execution of complex, long-horizon manipulation sequences, surpassing state-of-the-art generative imitation learning approaches in both robustness and compositional generalization.

Technology Category

Application Category

📝 Abstract

Imitation learning (IL), particularly when leveraging high-dimensional visual inputs for policy training, has proven intuitive and effective in complex bimanual manipulation tasks. Nonetheless, the generalization capability of visuomotor policies remains limited, especially when small demonstration datasets are available. Accumulated errors in visuomotor policies significantly hinder their ability to complete long-horizon tasks. To address these limitations, we propose SViP, a framework that seamlessly integrates visuomotor policies into task and motion planning (TAMP). SViP partitions human demonstrations into bimanual and unimanual operations using a semantic scene graph monitor. Continuous decision variables from the key scene graph are employed to train a switching condition generator. This generator produces parameterized scripted primitives that ensure reliable performance even when encountering out-of-the-distribution observations. Using only 20 real-world demonstrations, we show that SViP enables visuomotor policies to generalize across out-of-distribution initial conditions without requiring object pose estimators. For previously unseen tasks, SViP automatically discovers effective solutions to achieve the goal, leveraging constraint modeling in TAMP formulism. In real-world experiments, SViP outperforms state-of-the-art generative IL methods, indicating wider applicability for more complex tasks. Project website: https://sites.google.com/view/svip-bimanual

Problem

Research questions and friction points this paper is trying to address.

Improving generalization of visuomotor policies with limited demonstrations

Reducing accumulated errors in long-horizon bimanual manipulation tasks

Enabling automatic discovery of solutions for unseen tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates visuomotor policies with TAMP framework

Uses semantic scene graph for demonstration partitioning

Generates parameterized primitives for reliable performance

🔎 Similar Papers

Vision-based Manipulation from Single Human Video with Open-World Object Graphs