ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing one-shot imitation learning (OSIL) methods struggle to generalize to long-horizon dexterous manipulation tasks. This paper proposes an interaction-aware primitive decomposition paradigm: it decomposes tasks into reusable manipulation primitives driven by physical interaction events; integrates vision-language models (VLMs) with state-based heuristics for dual-path high-level planning; and generates end-effector poses via cross-view critical region detection and spatial correspondence learning. By decoupling task execution from continuous trajectory imitation, the method enables zero-shot generalization from short-task training to long-horizon one-shot demonstrations. In simulation, training on only 10 short tasks yields a 22.8% performance gain over SOTA on 20 unseen long-horizon tasks. On a real robot, it successfully executes three categories of complex, long-horizon dexterous manipulations.

Technology Category

Application Category

📝 Abstract
One-shot imitation learning (OSIL) offers a promising way to teach robots new skills without large-scale data collection. However, current OSIL methods are primarily limited to short-horizon tasks, thus limiting their applicability to complex, long-horizon manipulations. To address this limitation, we propose ManiLong-Shot, a novel framework that enables effective OSIL for long-horizon prehensile manipulation tasks. ManiLong-Shot structures long-horizon tasks around physical interaction events, reframing the problem as sequencing interaction-aware primitives instead of directly imitating continuous trajectories. This primitive decomposition can be driven by high-level reasoning from a vision-language model (VLM) or by rule-based heuristics derived from robot state changes. For each primitive, ManiLong-Shot predicts invariant regions critical to the interaction, establishes correspondences between the demonstration and the current observation, and computes the target end-effector pose, enabling effective task execution. Extensive simulation experiments show that ManiLong-Shot, trained on only 10 short-horizon tasks, generalizes to 20 unseen long-horizon tasks across three difficulty levels via one-shot imitation, achieving a 22.8% relative improvement over the SOTA. Additionally, real-robot experiments validate ManiLong-Shot's ability to robustly execute three long-horizon manipulation tasks via OSIL, confirming its practical applicability.
Problem

Research questions and friction points this paper is trying to address.

Enables one-shot imitation learning for long-horizon manipulation tasks.
Decomposes tasks into interaction-aware primitives instead of continuous trajectories.
Generalizes to unseen tasks with minimal training data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes long tasks into interaction-aware primitive sequences
Uses vision-language models to guide primitive decomposition
Predicts invariant regions and correspondences for pose computation
🔎 Similar Papers
No similar papers found.
Z
Zixuan Chen
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
C
Chongkai Gao
School of Computing, National University of Singapore, Singapore
L
Lin Shao
School of Computing, National University of Singapore, Singapore
J
Jieqi Shi
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Jing Huo
Jing Huo
Nanjing University
Machine LearningComputer Vision
Y
Yang Gao
School of Network Security and Information Technology, YiLi Normal University, Xinjiang, China; State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Intelligence Science and Technology, Nanjing University, Suzhou, China