Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser

📅 2024-03-07
🏛️ AAAI Conference on Artificial Intelligence
📈 Citations: 14
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion-based monocular 3D human pose estimation methods suffer significant performance degradation when decoupling bone-length and bone-direction prediction, primarily due to error propagation across the hierarchical tree-like skeletal structure and insufficient modeling of hierarchical spatio-temporal dependencies. This work introduces the first decoupled diffusion framework for monocular video sequences: (1) a novel joint decoupled diffusion model for bone length and bone direction; (2) a Hierarchical Spatio-Temporal Denoiser (HSTDenoiser) that explicitly encodes parent-child joint spatial constraints and temporally adjacent joint correlations; and (3) integration of skeletal priors with hierarchical Transformers (HRST for spatial and HRTT for temporal modeling) to enhance structural consistency. Evaluated on Human3.6M and MPI-INF-3DHP, our method achieves MPJPE improvements of 10.0%, 2.0%, and 1.3% over state-of-the-art decoupled, non-decoupled, and probabilistic methods, respectively.

Technology Category

Application Category

📝 Abstract
Recently, diffusion-based methods for monocular 3D human pose estimation have achieved state-of-the-art (SOTA) performance by directly regressing the 3D joint coordinates from the 2D pose sequence. Although some methods decompose the task into bone length and bone direction prediction based on the human anatomical skeleton to explicitly incorporate more human body prior constraints, the performance of these methods is significantly lower than that of the SOTA diffusion-based methods. This can be attributed to the tree structure of the human skeleton. Direct application of the disentangled method could amplify the accumulation of hierarchical errors, propagating through each hierarchy. Meanwhile, the hierarchical information has not been fully explored by the previous methods. To address these problems, a Disentangled Diffusion-based 3D human Pose Estimation method with Hierarchical Spatial and Temporal Denoiser is proposed, termed DDHPose. In our approach: (1) We disentangle the 3d pose and diffuse the bone length and bone direction during the forward process of the diffusion model to effectively model the human pose prior. A disentanglement loss is proposed to supervise diffusion model learning. (2) For the reverse process, we propose Hierarchical Spatial and Temporal Denoiser (HSTDenoiser) to improve the hierarchical modelling of each joint. Our HSTDenoiser comprises two components: the Hierarchical-Related Spatial Transformer (HRST) and the Hierarchical-Related Temporal Transformer (HRTT). HRST exploits joint spatial information and the influence of the parent joint on each joint for spatial modeling, while HRTT utilizes information from both the joint and its hierarchical adjacent joints to explore the hierarchical temporal correlations among joints. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets show that our method outperforms the SOTA disentangled-based, non-disentangled based, and probabilistic approaches by 10.0%, 2.0%, and 1.3%, respectively.
Problem

Research questions and friction points this paper is trying to address.

Improving 3D human pose estimation via disentangled diffusion
Reducing hierarchical error accumulation in skeleton-based models
Enhancing spatial and temporal modeling of joint hierarchies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled diffusion for bone length and direction
Hierarchical Spatial and Temporal Denoiser (HSTDenoiser)
Joint spatial and temporal hierarchical transformers
🔎 Similar Papers
No similar papers found.
Q
Qingyuan Cai
School of Artificial Intelligence, Beijing Normal University
X
Xuecai Hu
School of Artificial Intelligence, Beijing Normal University
Saihui Hou
Saihui Hou
Beijing Normal University
Deep LearningComputer VisionMultimodal Large Language Models
L
Li Yao
School of Artificial Intelligence, Beijing Normal University
Yongzhen Huang
Yongzhen Huang
School of Artificial Intelligence, Beijing Normal University
Computer VisionPattern RecognitionDeep Learning