🤖 AI Summary
Monocular 3D human pose estimation suffers from depth ambiguity and occlusion-induced accuracy degradation, while existing methods struggle to adequately model multi-scale, high-order structural dependencies among joints. This paper proposes a novel framework integrating hypergraph convolution with diffusion modeling: HyperGCN serves as the diffusion denoiser, explicitly capturing multi-granularity, high-order joint correlations via hypergraphs to enhance robustness against occlusion and depth uncertainty; simultaneously, the diffusion process probabilistically models the inherent ambiguity in the 2D-to-3D mapping. Evaluated on Human3.6M and MPI-INF-3DHP, our method achieves state-of-the-art performance—outperforming prior works in accuracy while maintaining computational scalability and enabling efficient deployment under diverse resource constraints.
📝 Abstract
Monocular 3D human pose estimation (HPE) often encounters challenges such as depth ambiguity and occlusion during the 2D-to-3D lifting process. Additionally, traditional methods may overlook multi-scale skeleton features when utilizing skeleton structure information, which can negatively impact the accuracy of pose estimation. To address these challenges, this paper introduces a novel 3D pose estimation method, HyperDiff, which integrates diffusion models with HyperGCN. The diffusion model effectively captures data uncertainty, alleviating depth ambiguity and occlusion. Meanwhile, HyperGCN, serving as a denoiser, employs multi-granularity structures to accurately model high-order correlations between joints. This improves the model's denoising capability especially for complex poses. Experimental results demonstrate that HyperDiff achieves state-of-the-art performance on the Human3.6M and MPI-INF-3DHP datasets and can flexibly adapt to varying computational resources to balance performance and efficiency.