🤖 AI Summary
Multi-teacher knowledge distillation suffers from high computational overhead and deployment complexity. Method: This paper proposes a single-teacher, multi-perspective knowledge distillation framework: it constructs parallel branches within a single teacher model to explicitly model semantic-level viewpoint diversity. We innovatively design inter-angle diversity loss and intra-angle diversity loss to jointly optimize feature distributions—enforcing orthogonality across branches and consistency within each branch—without requiring auxiliary teacher networks to generate complementary knowledge views. Contribution/Results: Theoretical analysis shows our method reduces the upper bound of ensemble error. Extensive experiments demonstrate significant improvements over state-of-the-art knowledge-augmented distillation methods across diverse student architectures and benchmarks. Moreover, the framework is plug-and-play: it seamlessly integrates into mainstream distillation pipelines while enhancing generalization performance.
📝 Abstract
Knowledge Distillation (KD) aims to train a lightweight student model by transferring knowledge from a large, high-capacity teacher. Recent studies have shown that leveraging diverse teacher perspectives can significantly improve distillation performance; however, achieving such diversity typically requires multiple teacher networks, leading to high computational costs. In this work, we propose a novel cost-efficient knowledge augmentation method for KD that generates diverse multi-views by attaching multiple branches to a single teacher. To ensure meaningful semantic variation across multi-views, we introduce two angular diversity objectives: 1) constrained inter-angle diversify loss, which maximizes angles between augmented views while preserving proximity to the original teacher output, and 2) intra-angle diversify loss, which encourages an even distribution of views around the original output. The ensembled knowledge from these angularly diverse views, along with the original teacher, is distilled into the student. We further theoretically demonstrate that our objectives increase the diversity among ensemble members and thereby reduce the upper bound of the ensemble's expected loss, leading to more effective distillation. Experimental results show that our method surpasses an existing knowledge augmentation method across diverse configurations. Moreover, the proposed method is compatible with other KD frameworks in a plug-and-play fashion, providing consistent improvements in generalization performance.