🤖 AI Summary
To address the high computational cost and deployment challenges of Transformer-based image restoration models, this paper proposes a soft knowledge distillation framework for lightweight image restoration. Our method introduces three key innovations: (1) a Multi-dimensional Cross-network Attention (MCA) mechanism that jointly models implicit cross-channel and cross-spatial attention relationships between teacher and student networks; (2) a kernel-space Gaussian distance metric to improve alignment fidelity of attention distributions; and (3) an image-level contrastive learning loss, replacing conventional L1 or KL divergence losses to enhance semantic consistency. Evaluated on deraining, deblurring, and denoising tasks, the distilled student model achieves average reductions of 62% in parameter count and 58% in FLOPs, while suffering only marginal PSNR/SSIM degradation (0.15–0.32 dB / 0.002–0.005), closely matching teacher performance.
📝 Abstract
Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.