VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception

📅 2024-12-31

🏛️ International Conference on Machine Learning

📈 Citations: 1

✨ Influential: 0

career value

231K/year

🤖 AI Summary

High-fidelity 6D pose estimation of in-hand objects is hindered by the scarcity of large-scale, high-quality real-world multimodal datasets. Method: We introduce VinT-6D—the first large-scale multimodal dataset tailored for humanoid dexterous manipulation—comprising 2 million synthetic and 100,000 real-world samples. It features the first fully synchronized acquisition of whole-hand tactile (GelSight), visual, and proprioceptive data with high spatiotemporal alignment, enabled by a custom hardware platform and co-simulation in MuJoCo/Blender. Contribution/Results: We propose a cross-modal calibration and alignment framework; VinT-Real is currently the largest real-world multimodal hand dataset, substantially narrowing the Sim2Real gap. We establish a multimodal fusion benchmark, achieving significant accuracy improvements over unimodal baselines on in-hand object 6D pose estimation. The dataset is publicly released and widely adopted.

Technology Category

Application Category

📝 Abstract

This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the ``Perception-Planning-Control"paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real splits, collected via simulations in MuJoCo and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so that it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present a benchmark method that shows significant improvements in performance by fusing multi-modal information. The project is available at https://VinT-6D.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Hand Pose Estimation

Robot Manipulation

Large-scale Dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modalDataset

RoboticIn-handManipulation

VisionTouchProprioception

🔎 Similar Papers

Canonical Representation and Force-Based Pretraining of 3D Tactile for Dexterous Visuo-Tactile Policy Learning

2024-09-26arXiv.orgCitations: 1