🤖 AI Summary
Diffusion models for 3D human pose estimation suffer from excessive iterative steps and high computational overhead due to multi-hypothesis sampling. This paper proposes a hierarchical temporal pruning strategy—the first to jointly integrate frame-level dynamic pruning with semantic-level pose token pruning—augmented by temporal correlation modeling, sparsity-focused multi-head self-attention, and mask-guided adaptive graph clustering. The method achieves joint sparsification of inter-frame motion structure and semantic criticality. Evaluated on Human3.6M and MPI-INF-3DHP, it reduces training MACs by 38.5% and inference MACs by 56.8%, while accelerating inference by 81.1%, all without compromising state-of-the-art accuracy. The core contribution is the introduction of the first hierarchical pruning paradigm tailored for diffusion-based pose estimation, uniquely balancing dynamic fidelity and inference efficiency.
📝 Abstract
Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5%, inference MACs by 56.8%, and improves inference speed by an average of 81.1% compared to prior diffusion-based methods, while achieving state-of-the-art performance.