Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models for 3D human pose estimation suffer from excessive iterative steps and high computational overhead due to multi-hypothesis sampling. This paper proposes a hierarchical temporal pruning strategy—the first to jointly integrate frame-level dynamic pruning with semantic-level pose token pruning—augmented by temporal correlation modeling, sparsity-focused multi-head self-attention, and mask-guided adaptive graph clustering. The method achieves joint sparsification of inter-frame motion structure and semantic criticality. Evaluated on Human3.6M and MPI-INF-3DHP, it reduces training MACs by 38.5% and inference MACs by 56.8%, while accelerating inference by 81.1%, all without compromising state-of-the-art accuracy. The core contribution is the introduction of the first hierarchical pruning paradigm tailored for diffusion-based pose estimation, uniquely balancing dynamic fidelity and inference efficiency.

Technology Category

Application Category

📝 Abstract
Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5%, inference MACs by 56.8%, and improves inference speed by an average of 81.1% compared to prior diffusion-based methods, while achieving state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost of diffusion-based 3D pose estimation
Pruning redundant pose tokens while preserving motion dynamics
Improving efficiency while maintaining state-of-the-art performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Temporal Pruning strategy
Temporal Correlation-Enhanced Pruning frames
Mask-Guided Pose Token Pruner
🔎 Similar Papers
No similar papers found.
Y
Yuquan Bi
School of Cyber Science and Engineering, Southeast University, Nanjing, China
H
Hongsong Wang
School of Computer Science and Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University, Nanjing 210096, China
Xinli Shi
Xinli Shi
ARC DECRA Fellow
Distributed LearningMulti-Agent Reinforcement LearningMPC
Zhipeng Gui
Zhipeng Gui
Professor of GIScience, Wuhan University
GeoAISpatiotemporal Data AnalysisWeb Service & QoSHigh Performance Computing
Jie Gui
Jie Gui
Southeast University, China
Pattern Recognition and Machine LearningArtificial IntelligenceData MiningDeep LearningImage Processing and Computer Vis
Yuan Yan Tang
Yuan Yan Tang
University of Macau
Pattern recognitionImage processing