🤖 AI Summary
Real-world 3D motion annotation is scarce for everyday single-image, dual-hand 3D pose estimation. Method: We propose a diffusion-based 4D (3D + temporal) hand motion generation and prediction framework comprising: (1) a diffusion-driven annotation enhancement pipeline that maps 2D keypoint sequences to high-fidelity 4D pseudo-labels; (2) a diffusion loss function explicitly modeling the multimodal distribution of hand motion; and (3) integration of temporal modeling networks with multi-source data training to improve zero-shot generalization. Results: Experiments across six benchmarks show that generated pseudo-labels boost performance by 14% on average; the motion upscaling module outperforms the best baseline by 42%; end-to-end prediction accuracy improves by 16.4%, with particularly strong generalization to unseen in-the-wild images.
📝 Abstract
We tackle the problem of forecasting bimanual 3D hand motion&articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better)&forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.