ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing knowledge distillation methods rely on teacher-driven static feature selection, which fails to adapt to the student model’s dynamic learning state—limiting efficiency in time-sensitive applications such as autonomous driving that demand dense visual predictions (e.g., detection, segmentation). To address this, we propose a Dynamic Collaborative Distillation framework featuring two novel components: Student–Teacher Cross-Attention Feature Fusion (STCA-FF) and Adaptive Spatial–Channel Masking (ASCM), enabling bidirectional interaction and personalized feature selection throughout the distillation process. On COCO2017 object detection, our method improves the ResNet-50 student’s mAP by 1.4 points; on Cityscapes semantic segmentation, it boosts the MobileNetV2 student’s mIoU by 3.09 points. The framework significantly enhances student model adaptability and distillation efficiency, establishing a new paradigm for real-time dense prediction tasks.

Technology Category

Application Category

📝 Abstract
Dense visual prediction tasks, such as detection and segmentation, are crucial for time-critical applications (e.g., autonomous driving and video surveillance). While deep models achieve strong performance, their efficiency remains a challenge. Knowledge distillation (KD) is an effective model compression technique, but existing feature-based KD methods rely on static, teacher-driven feature selection, failing to adapt to the student's evolving learning state or leverage dynamic student-teacher interactions. To address these limitations, we propose Adaptive student-teacher Cooperative Attention Masking for Knowledge Distillation (ACAM-KD), which introduces two key components: (1) Student-Teacher Cross-Attention Feature Fusion (STCA-FF), which adaptively integrates features from both models for a more interactive distillation process, and (2) Adaptive Spatial-Channel Masking (ASCM), which dynamically generates importance masks to enhance both spatial and channel-wise feature selection. Unlike conventional KD methods, ACAM-KD adapts to the student's evolving needs throughout the entire distillation process. Extensive experiments on multiple benchmarks validate its effectiveness. For instance, on COCO2017, ACAM-KD improves object detection performance by up to 1.4 mAP over the state-of-the-art when distilling a ResNet-50 student from a ResNet-101 teacher. For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline with DeepLabV3-MobileNetV2 as the student model.
Problem

Research questions and friction points this paper is trying to address.

Improves efficiency in dense visual prediction tasks.
Enhances knowledge distillation with adaptive feature selection.
Boosts performance in object detection and semantic segmentation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive student-teacher cross-attention feature fusion
Dynamic spatial-channel masking for feature selection
Enhanced knowledge distillation with interactive processes
🔎 Similar Papers
No similar papers found.