🤖 AI Summary
This work addresses the challenge of “last-millimeter” feedback failure in precision assembly caused by visual occlusion by proposing a vision–tactile imitation learning approach. The method integrates a Transformer-based bidirectional visual–tactile cross-attention mechanism, a proprioceptive gating network, and a tactile reconstruction loss to dynamically enhance the dominance of tactile feedback when vision is compromised, while guiding the model to learn task-relevant contact features. Evaluated on the NIST M1 benchmark, the approach achieves a 90% success rate in peg-in-hole insertion tasks and maintains an 80% success rate even under industrial-grade tolerances with a 0.1 mm clearance, significantly outperforming purely visual and general multimodal baselines.
📝 Abstract
Precision assembly requires sub-millimeter corrections in contact-rich "last-millimeter" regions where visual feedback fails due to occlusion from the end-effector and workpiece. We present ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms: (i) bidirectional cross-attention enabling reciprocal visuo-tactile feature enhancement before fusion, (ii) a proprioception-conditioned gating network that dynamically elevates tactile reliance when visual occlusion occurs, and (iii) a tactile reconstruction objective enforcing learning of manipulation-relevant contact information rather than generic visual textures. Evaluated on the standardized NIST Assembly Task Board M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance. Ablation studies validate that each architectural component is indispensable. The ReTac-ACT codebase and a vision-tactile demonstration dataset covering various clearance levels with both visual and tactile features will be released to support reproducible research.