ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of “last-millimeter” feedback failure in precision assembly caused by visual occlusion by proposing a vision–tactile imitation learning approach. The method integrates a Transformer-based bidirectional visual–tactile cross-attention mechanism, a proprioceptive gating network, and a tactile reconstruction loss to dynamically enhance the dominance of tactile feedback when vision is compromised, while guiding the model to learn task-relevant contact features. Evaluated on the NIST M1 benchmark, the approach achieves a 90% success rate in peg-in-hole insertion tasks and maintains an 80% success rate even under industrial-grade tolerances with a 0.1 mm clearance, significantly outperforming purely visual and general multimodal baselines.

Technology Category

Application Category

📝 Abstract
Precision assembly requires sub-millimeter corrections in contact-rich "last-millimeter" regions where visual feedback fails due to occlusion from the end-effector and workpiece. We present ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms: (i) bidirectional cross-attention enabling reciprocal visuo-tactile feature enhancement before fusion, (ii) a proprioception-conditioned gating network that dynamically elevates tactile reliance when visual occlusion occurs, and (iii) a tactile reconstruction objective enforcing learning of manipulation-relevant contact information rather than generic visual textures. Evaluated on the standardized NIST Assembly Task Board M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance. Ablation studies validate that each architectural component is indispensable. The ReTac-ACT codebase and a vision-tactile demonstration dataset covering various clearance levels with both visual and tactile features will be released to support reproducible research.
Problem

Research questions and friction points this paper is trying to address.

precision assembly
visual occlusion
tactile feedback
last-millimeter
sub-millimeter correction
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-tactile fusion
cross-attention
tactile reconstruction
state-gated network
precision assembly
🔎 Similar Papers
No similar papers found.
M
Minchi Ruan
Beijing University of Posts and Telecommunications, Beijing, China; SunHDex Intelligent Technology (Beijing) Co., Ltd., Beijing, China
L
LiangQing Zhou
Beijing University of Posts and Telecommunications, Beijing, China; SunHDex Intelligent Technology (Beijing) Co., Ltd., Beijing, China
H
Hongtong Li
Beijing University of Posts and Telecommunications, Beijing, China
Z
Zongtao Wang
SunHDex Intelligent Technology (Beijing) Co., Ltd., Beijing, China
Z
ZhaoMing Lu
Beijing University of Posts and Telecommunications, Beijing, China
J
Jianwei Zhang
University of Hamburg, Hamburg, Germany
Bin Fang
Bin Fang
Beijing University of Posts and Telecommunications /Tsinghua University
Robotics and AI