S$^2$-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robot manipulation skills often suffer from limited generalization across objects of the same category due to reliance on instance-specific training data. To address this, we propose an open-vocabulary spatial-semantic diffusion framework—the first to enable skill transfer from instance-level to category-level manipulation. Our method decouples spatial-geometric and functional-semantic representations, operating solely on a single RGB image input. It integrates a diffusion model, a promptable semantic module, a depth estimation network, and a spatial feature encoder—requiring neither multi-view inputs nor camera calibration. Evaluated in both simulation and on real robotic platforms, the approach demonstrates robustness to intra-category variations in appearance, scale, and pose. It successfully executes manipulation tasks—including grasping and placing—on unseen objects within trained categories, achieving significantly higher generalization accuracy than existing baselines.

Technology Category

Application Category

📝 Abstract
Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment extit{instances} that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S$^2$-Diffusion) which enables generalization from instance-level training data to category-level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S$^2$-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Full videos of all real-world experiments are available in the supplementary material.
Problem

Research questions and friction points this paper is trying to address.

Generalizes robot skills from instance to category level.
Uses semantic and spatial modules for skill transferability.
Enables single RGB camera use via depth estimation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Semantic Diffusion policy
Promptable semantic module
Depth estimation networks