๐ค AI Summary
Existing hand-object interaction reconstruction methods rely solely on vision, limiting their ability to model occlusions and deformable object dynamics. This paper proposes ViTaM-D, the first framework integrating vision with distributed tactile sensing for dynamic, high-fidelity interaction reconstruction. Its key contributions are: (1) DF-Fieldโa novel implicit field that unifies distributed force perception by jointly encoding contact kinetic and potential energy; (2) HOTโa first-of-its-kind high-fidelity simulation benchmark specifically designed for deformable object interaction; and (3) a synergistic pipeline comprising VDT-Net for initial reconstruction and a Force-aware Optimization (FO) algorithm for refinement. Evaluated on DexYCB and HOT, ViTaM-D reduces hand pose error by 23.6% and improves object deformation reconstruction PSNR by 5.8 dB over state-of-the-art methods including HOTrack and gSDF.
๐ Abstract
We present ViTaM-D, a novel visual-tactile framework for dynamic hand-object interaction reconstruction, integrating distributed tactile sensing for more accurate contact modeling. While existing methods focus primarily on visual inputs, they struggle with capturing detailed contact interactions such as object deformation. Our approach leverages distributed tactile sensors to address this limitation by introducing DF-Field. This distributed force-aware contact representation models both kinetic and potential energy in hand-object interaction. ViTaM-D first reconstructs hand-object interactions using a visual-only network, VDT-Net, and then refines contact details through a force-aware optimization (FO) process, enhancing object deformation modeling. To benchmark our approach, we introduce the HOT dataset, which features 600 sequences of hand-object interactions, including deformable objects, built in a high-precision simulation environment. Extensive experiments on both the DexYCB and HOT datasets demonstrate significant improvements in accuracy over previous state-of-the-art methods such as gSDF and HOTrack. Our results highlight the superior performance of ViTaM-D in both rigid and deformable object reconstruction, as well as the effectiveness of DF-Field in refining hand poses. This work offers a comprehensive solution to dynamic hand-object interaction reconstruction by seamlessly integrating visual and tactile data. Codes, models, and datasets will be available.