🤖 AI Summary
This work addresses the challenge of accurately estimating the 6D pose of grasped objects under severe occlusion, where vision-only approaches often fail. To overcome this limitation, we propose a multimodal method that fuses visual and fingertip tactile sensing. Tactile signals are uniformly represented as contact point clouds, and a pixel-wise dense visual-tactile feature fusion network is designed to enable high-precision pose estimation. To facilitate training, we extend the NVIDIA DISE-based synthetic data generation pipeline to jointly produce realistic RGB images and corresponding tactile point clouds. Experimental results on a real robotic platform demonstrate that our approach significantly outperforms vision-only baselines, and that the model trained on synthetic data generalizes effectively to real-world scenarios.
📝 Abstract
Knowledge of the 6D pose of an object can benefit in-hand object manipulation. Existing 6D pose estimation methods use vision data. In-hand 6D object pose estimation is challenging because of heavy occlusion produced by the robots grippers, which can have an adverse effect on methods that rely on vision data only. Many robots are equipped with tactile sensors at their fingertips that could be used to complement vision data. In this paper, we present a method that uses both tactile and vision data to estimate the pose of an object grasped in a robots hand. The main challenges of this research include 1) lack of standard representation for tactile sensor data, 2) fusion of sensor data from heterogeneous sourcesvision and tactile, and 3) a need for large training datasets. To address these challenges, first, we propose use of point clouds to represent object surfaces that are in contact with the tactile sensor. Second, we present a network architecture based on pixel-wise dense fusion to fuse vision and tactile data to estimate the 6D pose of an object. Third, we extend NVIDIAs Deep Learning Dataset Synthesizer to produce synthetic photo-realistic vision data and the corresponding tactile point clouds for 11 objects from the YCB Object and Model Set in Unreal Engine 4. We present results of simulated experiments suggesting that using tactile data in addition to vision data improves the 6D pose estimate of an in-hand object. We also present qualitative results of experiments in which we deploy our network on real physical robots showing successful transfer of a network trained on synthetic data to a real system.