🤖 AI Summary
This work addresses the inherent ambiguities in 3D human pose estimation from 2D images—particularly depth uncertainty and occlusion-induced solution space indeterminacy—and overcomes the limited generalization of existing approaches that rely heavily on paired 2D-3D training data. The authors propose a novel paradigm based on an unconditional diffusion model, introducing for the first time a geometry-guided mechanism that leverages gradients from 2D keypoint heatmaps to steer the diffusion process. This enables the generation of multiple plausible, input-consistent 3D poses without requiring any paired 2D-3D training data. The method naturally supports multi-hypothesis prediction and pose completion without retraining conditional models. It achieves state-of-the-art performance among unpaired-data methods on Human3.6M and demonstrates exceptional generalization on MPI-INF-3DHP and 3DPW benchmarks.
📝 Abstract
3D human pose estimation from 2D images is a challenging problem due to depth ambiguity and occlusion. Because of these challenges the task is underdetermined, where there exists multiple—possibly infinite—poses that are plausible given the image. Despite this, many prior works assume the existence of a deterministic mapping and estimate a single pose given an image. Furthermore, methods based on machine learning require a large amount of paired 2D-3D data to train and suffer from generalization issues to unseen scenarios. To address both of these issues, we propose a framework for pose estimation using diffusion models, which enables sampling from a probability distribution over plausible poses which are consistent with a 2D image. Our approach falls under the guidance framework for conditional generation, and guides samples from an unconditional diffusion model, trained only on 3D data, using the gradients of the heatmaps from a 2D keypoint detector. We evaluate our method on the Human 3.6M dataset under best-of- m multiple hypothesis evaluation, showing state-of-the-art performance among methods which do not require paired 2D-3D data for training. We additionally evaluate the generalization ability using the MPI-INF-3DHP and 3DPW datasets and demonstrate competitive performance. Finally, we demonstrate the flexibility of our framework by using it for novel tasks including pose generation and pose completion, without the need to train bespoke conditional models. We make code available at https://github.com/fsnelgar/diffusion_pose.