🤖 AI Summary
This work addresses the challenge of learning fine-grained manipulation skills from coarse-grained demonstrations. We propose a granularity-adaptable, memory-efficient action generation framework. Methodologically, we introduce the first integration of diffusion models with state-space models (Mamba) and design a step-scaling mechanism, enabling dynamic adjustment of action precision in end-to-end imitation learning—without requiring fine-grained annotations or external interpolation models. Our contributions are threefold: (1) continuous adjustability of action generation granularity; (2) significant improvements in memory efficiency and inference speed; and (3) state-of-the-art success rates—up to 81% higher than prior methods—across three “coarse-to-fine” benchmark tasks. We further validate cross-scale action generalization on both simulated and real-world robotic manipulation tasks.
📝 Abstract
We aim to solve the problem of generating coarse-to-fine skills learning from demonstrations (LfD). To scale precision, traditional LfD approaches often rely on extensive fine-grained demonstrations with external interpolations or dynamics models with limited generalization capabilities. For memory-efficient learning and convenient granularity change, we propose a novel diffusion-SSM based policy (DiSPo) that learns from diverse coarse skills and produces varying control scales of actions by leveraging a state-space model, Mamba. Our evaluations show the adoption of Mamba and the proposed step-scaling method enable DiSPo to outperform in three coarse-to-fine benchmark tests with maximum 81% higher success rate than baselines. In addition, DiSPo improves inference efficiency by generating coarse motions in less critical regions. We finally demonstrate the scalability of actions with simulation and real-world manipulation tasks.