Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Diffusion models for 3D human pose estimation suffer from excessive iterative steps and high computational overhead due to multi-hypothesis sampling. This paper proposes a hierarchical temporal pruning strategy—the first to jointly integrate frame-level dynamic pruning with semantic-level pose token pruning—augmented by temporal correlation modeling, sparsity-focused multi-head self-attention, and mask-guided adaptive graph clustering. The method achieves joint sparsification of inter-frame motion structure and semantic criticality. Evaluated on Human3.6M and MPI-INF-3DHP, it reduces training MACs by 38.5% and inference MACs by 56.8%, while accelerating inference by 81.1%, all without compromising state-of-the-art accuracy. The core contribution is the introduction of the first hierarchical pruning paradigm tailored for diffusion-based pose estimation, uniquely balancing dynamic fidelity and inference efficiency.

Technology Category

Application Category

📝 Abstract

Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5%, inference MACs by 56.8%, and improves inference speed by an average of 81.1% compared to prior diffusion-based methods, while achieving state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost of diffusion-based 3D pose estimation

Pruning redundant pose tokens while preserving motion dynamics

Improving efficiency while maintaining state-of-the-art performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Temporal Pruning strategy

Temporal Correlation-Enhanced Pruning frames

Mask-Guided Pose Token Pruner

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos