🤖 AI Summary
End-to-end learning for embodied manipulation suffers from a “data explosion” issue, requiring prohibitively large amounts of task-specific demonstration data.
Method: This paper proposes a dynamic atomic-skill-oriented construction paradigm. It employs vision-language planning (VLP) to decompose high-level tasks and abstract subtasks into reusable atomic skills; leverages vision-language-action (VLA) model fine-tuning for data-efficient skill accumulation; and introduces a novel three-stage incremental update mechanism that enables continuous skill refinement and composition.
Contribution/Results: This work marks the first shift in embodied manipulation learning—from monolithic end-to-end task policies to composable, transferable atomic skills. Experiments in real-world settings demonstrate substantial reductions in data requirements for new tasks while maintaining high manipulation accuracy and strong cross-task generalization. The approach further enables rapid adaptation to unseen environments and novel tasks, significantly enhancing scalability and practical deployability.
📝 Abstract
Embodied manipulation is a fundamental ability in the realm of embodied artificial intelligence. Although current embodied manipulation models show certain generalizations in specific settings, they struggle in new environments and tasks due to the complexity and diversity of real-world scenarios. The traditional end-to-end data collection and training manner leads to significant data demands, which we call ``data explosion''. To address the issue, we introduce a three-wheeled data-driven method to build an atomic skill library. We divide tasks into subtasks using the Vision-Language Planning (VLP). Then, atomic skill definitions are formed by abstracting the subtasks. Finally, an atomic skill library is constructed via data collection and Vision-Language-Action (VLA) fine-tuning. As the atomic skill library expands dynamically with the three-wheel update strategy, the range of tasks it can cover grows naturally. In this way, our method shifts focus from end-to-end tasks to atomic skills, significantly reducing data costs while maintaining high performance and enabling efficient adaptation to new tasks. Extensive experiments in real-world settings demonstrate the effectiveness and efficiency of our approach.