VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception

📅 2024-12-31
🏛️ International Conference on Machine Learning
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
High-fidelity 6D pose estimation of in-hand objects is hindered by the scarcity of large-scale, high-quality real-world multimodal datasets. Method: We introduce VinT-6D—the first large-scale multimodal dataset tailored for humanoid dexterous manipulation—comprising 2 million synthetic and 100,000 real-world samples. It features the first fully synchronized acquisition of whole-hand tactile (GelSight), visual, and proprioceptive data with high spatiotemporal alignment, enabled by a custom hardware platform and co-simulation in MuJoCo/Blender. Contribution/Results: We propose a cross-modal calibration and alignment framework; VinT-Real is currently the largest real-world multimodal hand dataset, substantially narrowing the Sim2Real gap. We establish a multimodal fusion benchmark, achieving significant accuracy improvements over unimodal baselines on in-hand object 6D pose estimation. The dataset is publicly released and widely adopted.

Technology Category

Application Category

📝 Abstract
This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the ``Perception-Planning-Control"paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real splits, collected via simulations in MuJoCo and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so that it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present a benchmark method that shows significant improvements in performance by fusing multi-modal information. The project is available at https://VinT-6D.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Hand Pose Estimation
Robot Manipulation
Large-scale Dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modalDataset
RoboticIn-handManipulation
VisionTouchProprioception
🔎 Similar Papers
No similar papers found.
Zhaoliang Wan
Zhaoliang Wan
Insta360
Generalist Robot Autonomy
Yonggen Ling
Yonggen Ling
Tencent Robotics X
SLAMVIOSenor FusionComputer Vision3D Reconstruction
S
Senlin Yi
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Lu Qi
Lu Qi
Insta360 | Wuhan Univeristy
Computer VisionDeep Learning
W
Wangwei Lee
Robotics X, Tencent, Shenzhen, China
M
Minglei Lu
Robotics X, Tencent, Shenzhen, China
Sicheng Yang
Sicheng Yang
Tencent Robotics X
Robot
X
Xiao Teng
Robotics X, Tencent, Shenzhen, China
P
Peng Lu
Robotics X, Tencent, Shenzhen, China
X
Xu Yang
Chinese Academy of Sciences, Automation Institute, Beijing, China
Ming-Hsuan Yang
Ming-Hsuan Yang
University of California at Merced; Google DeepMind
Computer VisionMachine LearningArtificial Intelligence
H
Hui Cheng
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China