HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation

📅 2025-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 2D-to-3D human pose estimation methods rely on temporal or visual cues to mitigate occlusion but fail to address the fundamental bottleneck of sparse skeleton inputs, limiting 3D reconstruction robustness. This paper proposes a two-stage generative densification framework: first, hierarchical skeleton tokenization and skeleton-aware alignment—integrated with a hierarchical autoregressive Transformer—generate high-density, structurally consistent 2D poses from sparse 2D keypoints; second, single-frame 3D pose lifting is performed using the dense 2D input. To our knowledge, this is the first method to jointly optimize skeleton-aware generative 2D densification and 3D estimation. It achieves state-of-the-art performance on single-frame 3D human pose estimation (HPE), surpassing mainstream multi-frame approaches while requiring fewer parameters and lower computational cost. Moreover, it is compatible with multi-frame methods, enabling synergistic performance gains.

Technology Category

Application Category

📝 Abstract
Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. We then develop a Hierarchical AutoRegressive Modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.
Problem

Research questions and friction points this paper is trying to address.

Addresses occlusion in 3D human pose estimation
Generates hierarchical 2D dense poses from sparse input
Improves robustness and reduces computational complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Pose AutoRegressive Transformer for occlusion
Multi-scale skeleton tokenization for dense pose
Skeleton-aware Alignment to strengthen token connections
🔎 Similar Papers
No similar papers found.