🤖 AI Summary
This work addresses the inherent ambiguity in monocular 2D-to-3D human pose lifting, where depth uncertainty and joint occlusion often yield multiple plausible 3D poses from a single 2D input. To tackle this challenge, the authors propose SnapPose3D, the first approach to leverage diffusion models for this task. During inference, SnapPose3D conditions the generative process on visual context and 2D pose, employing Gaussian sampling to produce diverse 3D pose hypotheses and aggregating them to achieve high accuracy. In training, it adopts a deterministic conditional denoising autoencoder framework. Notably, the method resolves pose ambiguity without relying on temporal information, achieving state-of-the-art performance on major benchmarks while maintaining computational efficiency and high precision.
📝 Abstract
Depth ambiguity and joint uncertainty are the two main obstacles in obtaining accurate human pose predictions by 2D-to-3D lifting methods proposed in the literature. In particular, these issues are caused by 2D joint locations that can be mapped to multiple 3D positions, inducing multiple possible final poses. Following these considerations, we propose leveraging diffusion-based models generation capability to predict multiple hypotheses and aggregate them in a final accurate pose. Therefore, we introduce SnapPose3D, a pose-lifting framework trained deterministically to denoise 3D poses conditioned on both visual context and 2D pose features. SnapPose3D adopts a probabilistic approach during inference, generating multiple hypotheses through random sampling from a unit Gaussian distribution. Unlike most previous methods that address pose ambiguity by processing temporal sequences, SnapPose3D uses single frames as input, avoiding tracking and limiting computational cost, data acquisition complexity, and the need for online, real-time applications. We extensively evaluate SnapPose3D on well-known benchmarks for the 3D human pose estimation task showing its ability to generate and aggregate accurate hypotheses that lead to state-of-the-art results.