Learning Semantic Atomic Skills for Multi-Task Robotic Manipulation

📅 2025-12-20

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

In multi-task robotic imitation learning, suboptimal demonstrations, trajectory noise, and behavioral multimodality hinder generalization, while existing skill-based approaches suffer from semantic fragmentation and poor cross-task reusability due to fixed segmentation or reliance on environment-specific priors. To address these challenges, this paper proposes a semantic-consistent, temporally coherent, variable-length atomic skill modeling paradigm. Key contributions include: (1) the first semantic atomic skill library built via gripper-state keyframe detection jointly annotated with vision-language models (VLMs); and (2) a novel action generation module incorporating “key-pose imagination” to jointly model long-horizon goal-directed reasoning and fine-grained motion control. End-to-end skill composition is achieved via contrastive learning and skill embedding representation. Experiments in simulation and on real robotic platforms demonstrate significant improvements in robustness, cross-task generalization, and long-sequence skill chaining capability.

Technology Category

Application Category

📝 Abstract

While imitation learning has shown impressive results in single-task robot manipulation, scaling it to multi-task settings remains a fundamental challenge due to issues such as suboptimal demonstrations, trajectory noise, and behavioral multi-modality. Existing skill-based methods attempt to address this by decomposing actions into reusable abstractions, but they often rely on fixed-length segmentation or environmental priors that limit semantic consistency and cross-task generalization. In this work, we propose AtomSkill, a novel multi-task imitation learning framework that learns and leverages a structured Atomic Skill Space for composable robot manipulation. Our approach is built on two key technical contributions. First, we construct a Semantically Grounded Atomic Skill Library by partitioning demonstrations into variable-length skills using gripper-state keyframe detection and vision-language model annotation. A contrastive learning objective ensures the resulting skill embeddings are both semantically consistent and temporally coherent. Second, we propose an Action Generation module with Keypose Imagination, which jointly predicts a skill's long-horizon terminal keypose and its immediate action sequence. This enables the policy to reason about overarching motion goals and fine-grained control simultaneously, facilitating robust skill chaining. Extensive experiments in simulated and real-world environments show that AtomSkill consistently outperforms state-of-the-art methods across diverse manipulation tasks.

Problem

Research questions and friction points this paper is trying to address.

Learning reusable semantic atomic skills for multi-task robotic manipulation

Addressing suboptimal demonstrations and behavioral multi-modality in imitation learning

Enhancing cross-task generalization with variable-length skill segmentation and keypose prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variable-length semantic skill library construction

Contrastive learning for skill embedding consistency

Keypose imagination for long-horizon action generation

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

2024-05-22arXiv.orgCitations: 1

Toyota Research Institute

Los Altos, CA / Cambridge, MA

AI Research Scientist, Robotics