🤖 AI Summary
This work addresses the challenges in imitation learning for contact-rich manipulation tasks—such as partial observability, discontinuous contact dynamics, and safety constraints—stemming from insufficient multimodal perception fusion. The authors propose a spatiotemporal attention mechanism grounded in entropy-regularized optimal transport to structurally align sub-queries from visual image patches, force, and proprioceptive modalities, integrated with a diffusion policy for action sequence generation. By replacing conventional softmax with optimal transport attention featuring explicit marginal constraints, the method incorporates structured inductive biases tailored to contact-intensive tasks, substantially enhancing robustness and interpretability under varying illumination, occlusion, and disturbances. Experiments on a real robot demonstrate a 100% success rate in tight-fitting peg-in-hole insertion (versus 93% for the baseline) and maintain 82.5% success under perturbations (compared to 43.5% baseline), while enabling stage-dependent interpretable diagnostics.
📝 Abstract
Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch attention by an entropy-regularized Optimal Transport (OT) alignment between force-pose-derived sub-queries and visual patches. Explicit marginal constraints act as a structured inductive bias for contact-rich tasks, encouraging conditioning-aware spatial selection that is stable across illumination, distractors, and partial occlusion. SO-TA is paired with a diffusion-based sequence policy mapping observation windows to pose-action chunks. We evaluate SO-TA on three real-robot tasks: tight peg-in-hole assembly, BCM wiring-connector insertion, and curved-surface mark erasing. With ~200 rollouts per condition, SO-TA reaches 100% success on tight peg-in-hole versus 93% for cross-attention at matched capacity, and retains 82.5% success under illumination, distractor, and partial-occlusion perturbations where a concatenation baseline drops to 43.5%. OT-derived patch heatmaps and leave-one-out modality-influence ratios provide interpretable, phase-dependent diagnostics.