🤖 AI Summary
This work systematically investigates the sources of domain gap between synthetic and real data in 3D hand pose estimation—a problem lacking principled attribution analysis. We propose the first interpretable domain gap decomposition framework, quantifying four key factors: forearm modeling mismatch, image-frequency statistical discrepancies, hand pose distribution shift, and object occlusion divergence. Methodologically, we integrate a controllable rendering pipeline for synthetic data generation, frequency-domain feature analysis, pose-occlusion decoupled modeling, and cross-domain error attribution evaluation. On standard benchmarks (e.g., FreiHAND, HO3D), models trained solely on synthetic data achieve accuracy comparable to those trained on real data—exhibiting less than 0.5 mm mean joint error difference—thereby substantially narrowing the domain gap. To foster reproducibility and further research, our code and dataset are publicly released.
📝 Abstract
Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Code and data are available at: https://github.com/delaprada/HandSynthesis.git.