Expandable Residual Approximation for Knowledge Distillation

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

In knowledge distillation (KD), insufficient knowledge transfer arises from the capacity gap between teacher and student models. To address this, we propose the Extensible Residual Approximation (ERA) framework—the first to incorporate the asymptotic approximation principle of the Stone–Weierstrass theorem into KD. ERA employs a Multi-Branch Residual Network (MBRNet) to enable hierarchical, multi-step residual representation learning, thereby reducing the difficulty of single-step imitation. Additionally, we introduce Teacher Weight Integration (TWI), a novel strategy that mitigates capacity mismatch by adaptively aggregating teacher outputs across branches. ERA adopts a divide-and-conquer hierarchical knowledge transfer paradigm, substantially enhancing student model expressivity. Experiments demonstrate consistent improvements: +1.41% Top-1 accuracy on ImageNet classification and +1.40 AP on MS COCO object detection. ERA achieves state-of-the-art performance across multiple vision benchmarks.

Technology Category

Application Category

📝 Abstract

Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher's representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher's head weights. Extensive experiments show that ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks. Codes and models are available at https://github.com/Zhaoyi-Yan/ERA.

Problem

Research questions and friction points this paper is trying to address.

Bridging teacher-student capacity gap in knowledge distillation

Reducing difficulty of mimicking teacher representation

Mitigating inherent learning capacity disparity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expandable Residual Approximation for knowledge transfer

Multi-Branched Residual Network implements decomposition

Teacher Weight Integration reduces capacity disparity

🔎 Similar Papers

No similar papers found.