🤖 AI Summary
To address the poor robustness of robotic manipulation under visually degraded conditions (e.g., occlusion, blur) and the performance bottlenecks of imitation learning, this paper proposes a vision–tactile fusion policy learning framework. Methodologically, it introduces (1) a dual-channel GelSight tactile representation encoding both texture-geometry and dynamic interaction features, and (2) a vision-dominant cross-modal cross-attention fusion mechanism that enables tactile signals to compensate for visual deficiencies. Evaluated on surface wiping, peg insertion, and fragile object grasping-and-placing tasks, the framework achieves success rates of 92.3%–96.7%, outperforming the best baseline by an average of 11.5%. These results demonstrate substantial mitigation of visual uncertainty and validate the framework’s tactile-guided cross-modal generalization capability.
📝 Abstract
Visuotactile sensing offers rich contact information that can help mitigate performance bottlenecks in imitation learning, particularly under vision-limited conditions, such as ambiguous visual cues or occlusions. Effectively fusing visual and visuotactile modalities, however, presents ongoing challenges. We introduce GelFusion, a framework designed to enhance policies by integrating visuotactile feedback, specifically from high-resolution GelSight sensors. GelFusion using a vision-dominated cross-attention fusion mechanism incorporates visuotactile information into policy learning. To better provide rich contact information, the framework's core component is our dual-channel visuotactile feature representation, simultaneously leveraging both texture-geometric and dynamic interaction features. We evaluated GelFusion on three contact-rich tasks: surface wiping, peg insertion, and fragile object pick-and-place. Outperforming baselines, GelFusion shows the value of its structure in improving the success rate of policy learning.