SCJD: Sparse Correlation and Joint Distillation for Efficient 3D Human Pose Estimation

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D human pose estimation methods suffer from high computational cost and slow inference, while conventional knowledge distillation struggles to capture joint-wise spatial geometry and multi-frame temporal dependencies. To address these challenges, we propose a novel knowledge distillation framework for efficient and accurate 3D pose estimation. Our key contributions are: (1) sparse correlation-based input downsampling to reduce computational redundancy; (2) dynamic joint embedding distillation to explicitly model geometric constraints among joints; (3) adjacency-aware joint attention distillation to enhance local structural awareness; and (4) temporal consistency distillation to preserve motion coherence across frames. Integrated with sparse sequence sampling, multi-frame contextual feature distillation, and global upsampling supervision, our method achieves state-of-the-art performance on standard benchmarks (e.g., Human3.6M, MPI-INF-3DHP), improving inference speed by over 2× while maintaining superior pose accuracy—thereby attaining an optimal trade-off between efficiency and precision.

Technology Category

Application Category

📝 Abstract
Existing 3D Human Pose Estimation (HPE) methods achieve high accuracy but suffer from computational overhead and slow inference, while knowledge distillation methods fail to address spatial relationships between joints and temporal correlations in multi-frame inputs. In this paper, we propose Sparse Correlation and Joint Distillation (SCJD), a novel framework that balances efficiency and accuracy for 3D HPE. SCJD introduces Sparse Correlation Input Sequence Downsampling to reduce redundancy in student network inputs while preserving inter-frame correlations. For effective knowledge transfer, we propose Dynamic Joint Spatial Attention Distillation, which includes Dynamic Joint Embedding Distillation to enhance the student's feature representation using the teacher's multi-frame context feature, and Adjacent Joint Attention Distillation to improve the student network's focus on adjacent joint relationships for better spatial understanding. Additionally, Temporal Consistency Distillation aligns the temporal correlations between teacher and student networks through upsampling and global supervision. Extensive experiments demonstrate that SCJD achieves state-of-the-art performance. Code is available at https://github.com/wileychan/SCJD.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in 3D human pose estimation.
Improves spatial and temporal joint relationship understanding.
Balances efficiency and accuracy in pose estimation frameworks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Correlation Input Sequence Downsampling reduces redundancy
Dynamic Joint Spatial Attention Distillation enhances feature representation
Temporal Consistency Distillation aligns temporal correlations
🔎 Similar Papers
No similar papers found.
W
Weihong Chen
South China University of Technology
X
Xuemiao Xu
South China University of Technology
H
Haoxin Yang
South China University of Technology
Y
Yi Xie
South China University of Technology
P
Peng Xiao
South China University of Technology
C
Cheng Xu
The Hong Kong Polytechnic University
Huaidong Zhang
Huaidong Zhang
South China University of Technology
Computer Vision
P
P. Heng
The Chinese University of Hong Kong