WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

๐Ÿ“… 2024-09-18
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 8
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Addressing the challenge of real-time, high-fidelity 3D hand reconstruction from in-the-wild monocular video, this paper introduces the first end-to-end fully convolutional detection + Transformer-based reconstruction framework that directly predicts 3D hand meshes and poses from single framesโ€”without explicit temporal modeling. To support robust learning under occlusion and complex illumination, we construct a large-scale in-the-wild dataset comprising over two million real-scene hand images and propose a self-supervised geometric consistency constraint. Our method achieves state-of-the-art performance across mainstream 2D and 3D benchmarks, enabling stable, low-latency multi-hand 3D tracking. All code, pretrained models, and the dataset are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
In recent years, 3D hand pose estimation methods have garnered significant attention due to their extensive applications in human-computer interaction, virtual reality, and robotics. In contrast, there has been a notable gap in hand detection pipelines, posing significant challenges in constructing effective real-world multi-hand reconstruction systems. In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model. To tackle the limitations of previous methods and build a robust and stable detection network, we introduce a large-scale dataset with over than 2M in-the-wild hand images with diverse lighting, illumination, and occlusion conditions. Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks. Finally, we showcase the effectiveness of our pipeline to achieve smooth 3D hand tracking from monocular videos, without utilizing any temporal components. Code, models, and dataset are available https://rolpotamias.github.io/WiLoR.
Problem

Research questions and friction points this paper is trying to address.

Develops real-time multi-hand 3D reconstruction in-the-wild
Addresses lack of robust hand detection pipelines
Introduces large-scale dataset for diverse hand conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time fully convolutional hand localization
High-fidelity transformer-based 3D reconstruction
Large-scale dataset with diverse conditions
๐Ÿ”Ž Similar Papers
No similar papers found.