Self-Supervised Keypoint Detection with Distilled Depth Keypoint Representation

📅 2024-10-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unsupervised keypoint detection methods rely on image reconstruction, lack depth awareness, and suffer from spurious keypoint detections in background regions. To address these limitations, we propose Distill-DKP—a novel cross-modal knowledge distillation framework that transfers supervision from depth maps (teacher) to RGB images (student). Distill-DKP introduces the first embedding-level depth semantic transfer mechanism with explicit background suppression. It integrates depth map encoding, RGB-Depth dual-stream feature alignment, and embedding-level loss constraints. Evaluated on Human3.6M, Distill-DKP reduces mean L2 error by 47.15%; on Taichi, it lowers average error by 5.67%; and on DeepFashion, it improves keypoint accuracy by 1.3%. Crucially, inference requires only the student RGB model—no depth labels or manual annotations are needed. This work establishes a new paradigm for depth-aware unsupervised keypoint learning via cross-modal distillation at the embedding level.

Technology Category

Application Category

📝 Abstract
Existing unsupervised keypoint detection methods apply artificial deformations to images such as masking a significant portion of images and using reconstruction of original image as a learning objective to detect keypoints. However, this approach lacks depth information in the image and often detects keypoints on the background. To address this, we propose Distill-DKP, a novel cross-modal knowledge distillation framework that leverages depth maps and RGB images for keypoint detection in a self-supervised setting. During training, Distill-DKP extracts embedding-level knowledge from a depth-based teacher model to guide an image-based student model with inference restricted to the student. Experiments show that Distill-DKP significantly outperforms previous unsupervised methods by reducing mean L2 error by 47.15% on Human3.6M, mean average error by 5.67% on Taichi, and improving keypoints accuracy by 1.3% on DeepFashion dataset. Detailed ablation studies demonstrate the sensitivity of knowledge distillation across different layers of the network. Project Page: https://23wm13.github.io/distill-dkp/
Problem

Research questions and friction points this paper is trying to address.

Detecting human keypoints without depth information
Reducing keypoint detection errors in unsupervised methods
Improving accuracy via cross-modal depth-RGB knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal distillation for keypoint detection
Depth maps guide RGB image learning
Self-supervised training with teacher-student framework
🔎 Similar Papers
No similar papers found.