🤖 AI Summary
This work addresses the limitations of existing hand video datasets, which lack physical information such as contact forces and motion dynamics and are highly susceptible to occlusion, thereby hindering accurate modeling of hand–object interactions. To overcome these challenges, the authors introduce a multimodal sensing glove that fuses tactile and IMU signals and present HandSense—the first synchronized tactile–visual–inertial dataset. They further propose a 3D Gaussian-based hand representation combined with a diffusion model–driven inpainting module and neural rendering techniques to synthesize high-fidelity bare-hand videos from glove sensor inputs. The method significantly improves hand tracking accuracy under occlusion and enhances contact force estimation in video-driven applications, demonstrating the effectiveness and practical utility of HandSense for downstream tasks.
📝 Abstract
Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.