🤖 AI Summary
High-fidelity 6D pose estimation of in-hand objects is hindered by the scarcity of large-scale, high-quality real-world multimodal datasets. Method: We introduce VinT-6D—the first large-scale multimodal dataset tailored for humanoid dexterous manipulation—comprising 2 million synthetic and 100,000 real-world samples. It features the first fully synchronized acquisition of whole-hand tactile (GelSight), visual, and proprioceptive data with high spatiotemporal alignment, enabled by a custom hardware platform and co-simulation in MuJoCo/Blender. Contribution/Results: We propose a cross-modal calibration and alignment framework; VinT-Real is currently the largest real-world multimodal hand dataset, substantially narrowing the Sim2Real gap. We establish a multimodal fusion benchmark, achieving significant accuracy improvements over unimodal baselines on in-hand object 6D pose estimation. The dataset is publicly released and widely adopted.
📝 Abstract
This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the ``Perception-Planning-Control"paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real splits, collected via simulations in MuJoCo and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so that it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present a benchmark method that shows significant improvements in performance by fusing multi-modal information. The project is available at https://VinT-6D.github.io/.