TCPFormer: Learning Temporal Correlation with Implicit Pose Proxy for 3D Human Pose Estimation

📅 2025-01-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D human pose estimation methods typically model only a single temporal dependency, limiting their ability to capture complex long-range temporal correlations in 2D pose sequences—thus constraining accuracy and robustness. To address this, we propose the Implicit Pose Proxy (IPP) mechanism: a learnable intermediate representation that enables multi-path, fine-grained spatiotemporal dependency modeling within a Transformer architecture. IPP comprises three synergistic modules—Proxy Update Module (PUM), Proxy Invocation Module (PIM), and Proxy Attention Module (PAM)—which jointly refine temporal dynamics and spatial relationships. The method significantly improves resilience to rapid motion and occlusion. Evaluated on Human3.6M and MPI-INF-3DHP, it achieves state-of-the-art performance, reducing mean per-joint position error by 3.2% and 4.7%, respectively. These results demonstrate both effectiveness and strong generalization across diverse motion patterns and challenging scenarios.

Technology Category

Application Category

📝 Abstract
Recent multi-frame lifting methods have dominated the 3D human pose estimation. However, previous methods ignore the intricate dependence within the 2D pose sequence and learn single temporal correlation. To alleviate this limitation, we propose TCPFormer, which leverages an implicit pose proxy as an intermediate representation. Each proxy within the implicit pose proxy can build one temporal correlation therefore helping us learn more comprehensive temporal correlation of human motion. Specifically, our method consists of three key components: Proxy Update Module (PUM), Proxy Invocation Module (PIM), and Proxy Attention Module (PAM). PUM first uses pose features to update the implicit pose proxy, enabling it to store representative information from the pose sequence. PIM then invocates and integrates the pose proxy with the pose sequence to enhance the motion semantics of each pose. Finally, PAM leverages the above mapping between the pose sequence and pose proxy to enhance the temporal correlation of the whole pose sequence. Experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our proposed TCPFormer outperforms the previous state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

3D Human Pose Estimation
Temporal Relationships
2D Pose Sequence
Innovation

Methods, ideas, or system contributions that make the work stand out.

TCPFormer
Implicit Pose Representation
Temporal Pose Estimation
🔎 Similar Papers
No similar papers found.
Jiajie Liu
Jiajie Liu
Peking University
Computer Vision
M
Mengyuan Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
H
Hong Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
W
Wenhao Li
Nanyang Technological University