🤖 AI Summary
Existing 3D human pose estimation methods suffer from high computational cost and slow inference, while conventional knowledge distillation struggles to capture joint-wise spatial geometry and multi-frame temporal dependencies. To address these challenges, we propose a novel knowledge distillation framework for efficient and accurate 3D pose estimation. Our key contributions are: (1) sparse correlation-based input downsampling to reduce computational redundancy; (2) dynamic joint embedding distillation to explicitly model geometric constraints among joints; (3) adjacency-aware joint attention distillation to enhance local structural awareness; and (4) temporal consistency distillation to preserve motion coherence across frames. Integrated with sparse sequence sampling, multi-frame contextual feature distillation, and global upsampling supervision, our method achieves state-of-the-art performance on standard benchmarks (e.g., Human3.6M, MPI-INF-3DHP), improving inference speed by over 2× while maintaining superior pose accuracy—thereby attaining an optimal trade-off between efficiency and precision.
📝 Abstract
Existing 3D Human Pose Estimation (HPE) methods achieve high accuracy but suffer from computational overhead and slow inference, while knowledge distillation methods fail to address spatial relationships between joints and temporal correlations in multi-frame inputs. In this paper, we propose Sparse Correlation and Joint Distillation (SCJD), a novel framework that balances efficiency and accuracy for 3D HPE. SCJD introduces Sparse Correlation Input Sequence Downsampling to reduce redundancy in student network inputs while preserving inter-frame correlations. For effective knowledge transfer, we propose Dynamic Joint Spatial Attention Distillation, which includes Dynamic Joint Embedding Distillation to enhance the student's feature representation using the teacher's multi-frame context feature, and Adjacent Joint Attention Distillation to improve the student network's focus on adjacent joint relationships for better spatial understanding. Additionally, Temporal Consistency Distillation aligns the temporal correlations between teacher and student networks through upsampling and global supervision. Extensive experiments demonstrate that SCJD achieves state-of-the-art performance. Code is available at https://github.com/wileychan/SCJD.