🤖 AI Summary
Existing high-precision human pose estimation models suffer from excessive computational cost, making it challenging to simultaneously achieve accuracy and model efficiency. To address this, we propose a coarse-to-fine two-stage knowledge distillation framework. Our key contributions are: (1) a structure-aware joint loss that explicitly models geometric and semantic contextual relationships among keypoints; (2) an image-guided progressive graph convolutional network (IGP-GCN) that fuses visual features for fine-grained pose refinement; and (3) a progressive supervision training strategy to enhance the student model’s representational capacity and generalization. Evaluated on COCO and CrowdPose benchmarks, our method significantly outperforms state-of-the-art lightweight approaches—particularly on the challenging CrowdPose dataset with severe occlusion and high crowd density—achieving a favorable trade-off between accuracy and inference efficiency.
📝 Abstract
Human pose estimation has been widely applied in the human-centric understanding and generation, but most existing state-of-the-art human pose estimation methods require heavy computational resources for accurate predictions. In order to obtain an accurate, robust yet lightweight human pose estimator, one feasible way is to transfer pose knowledge from a powerful teacher model to a less-parameterized student model by knowledge distillation. However, the traditional knowledge distillation framework does not fully explore the contextual information among human joints. Thus, in this paper, we propose a novel coarse-to-fine two-stage knowledge distillation framework for human pose estimation. In the first-stage distillation, we introduce the human joints structure loss to mine the structural information among human joints so as to transfer high-level semantic knowledge from the teacher model to the student model. In the second-stage distillation, we utilize an Image-Guided Progressive Graph Convolutional Network (IGP-GCN) to refine the initial human pose obtained from the first-stage distillation and supervise the training of the IGP-GCN in the progressive way by the final output pose of teacher model. The extensive experiments on the benchmark dataset: COCO keypoint and CrowdPose datasets, show that our proposed method performs favorably against lots of the existing state-of-the-art human pose estimation methods, especially for the more complex CrowdPose dataset, the performance improvement of our model is more significant.