Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser

📅 2024-03-07

🏛️ AAAI Conference on Artificial Intelligence

📈 Citations: 14

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing diffusion-based monocular 3D human pose estimation methods suffer significant performance degradation when decoupling bone-length and bone-direction prediction, primarily due to error propagation across the hierarchical tree-like skeletal structure and insufficient modeling of hierarchical spatio-temporal dependencies. This work introduces the first decoupled diffusion framework for monocular video sequences: (1) a novel joint decoupled diffusion model for bone length and bone direction; (2) a Hierarchical Spatio-Temporal Denoiser (HSTDenoiser) that explicitly encodes parent-child joint spatial constraints and temporally adjacent joint correlations; and (3) integration of skeletal priors with hierarchical Transformers (HRST for spatial and HRTT for temporal modeling) to enhance structural consistency. Evaluated on Human3.6M and MPI-INF-3DHP, our method achieves MPJPE improvements of 10.0%, 2.0%, and 1.3% over state-of-the-art decoupled, non-decoupled, and probabilistic methods, respectively.

Technology Category

Application Category

📝 Abstract

Recently, diffusion-based methods for monocular 3D human pose estimation have achieved state-of-the-art (SOTA) performance by directly regressing the 3D joint coordinates from the 2D pose sequence. Although some methods decompose the task into bone length and bone direction prediction based on the human anatomical skeleton to explicitly incorporate more human body prior constraints, the performance of these methods is significantly lower than that of the SOTA diffusion-based methods. This can be attributed to the tree structure of the human skeleton. Direct application of the disentangled method could amplify the accumulation of hierarchical errors, propagating through each hierarchy. Meanwhile, the hierarchical information has not been fully explored by the previous methods. To address these problems, a Disentangled Diffusion-based 3D human Pose Estimation method with Hierarchical Spatial and Temporal Denoiser is proposed, termed DDHPose. In our approach: (1) We disentangle the 3d pose and diffuse the bone length and bone direction during the forward process of the diffusion model to effectively model the human pose prior. A disentanglement loss is proposed to supervise diffusion model learning. (2) For the reverse process, we propose Hierarchical Spatial and Temporal Denoiser (HSTDenoiser) to improve the hierarchical modelling of each joint. Our HSTDenoiser comprises two components: the Hierarchical-Related Spatial Transformer (HRST) and the Hierarchical-Related Temporal Transformer (HRTT). HRST exploits joint spatial information and the influence of the parent joint on each joint for spatial modeling, while HRTT utilizes information from both the joint and its hierarchical adjacent joints to explore the hierarchical temporal correlations among joints. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets show that our method outperforms the SOTA disentangled-based, non-disentangled based, and probabilistic approaches by 10.0%, 2.0%, and 1.3%, respectively.

Problem

Research questions and friction points this paper is trying to address.

Improving 3D human pose estimation via disentangled diffusion

Reducing hierarchical error accumulation in skeleton-based models

Enhancing spatial and temporal modeling of joint hierarchies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled diffusion for bone length and direction

Hierarchical Spatial and Temporal Denoiser (HSTDenoiser)

Joint spatial and temporal hierarchical transformers

🔎 Similar Papers

3D Human Pose Analysis via Diffusion Synthesis