Revisiting Cross-Architecture Distillation: Adaptive Dual-Teacher Transfer for Lightweight Video Models

๐Ÿ“… 2025-11-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In cross-architecture knowledge distillation (CAKD), structural mismatch between Vision Transformer (ViT) teachers and lightweight CNN students impedes efficient knowledge transfer. To address this, we propose a dual-teacher collaborative distillation framework that jointly leverages heterogeneous ViT and homogeneous CNN teachers. Our method introduces a prediction-difference-driven dynamic weighting mechanism, structural-discrepancy-aware residual feature distillation, and a lightweight auxiliary branch. By explicitly modeling and transferring architecture-agnostic discrepancy knowledge, it mitigates feature-space misalignment between teacher and student. Extensive experiments on HMDB51, EPIC-KITCHENS-100, and Kinetics-400 demonstrate state-of-the-art performance: our approach outperforms existing CAKD methods across all benchmarks, achieving up to a 5.95% absolute accuracy gain on HMDB51โ€”significantly narrowing the performance gap for lightweight CNNs in video action recognition.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision Transformers (ViTs) have achieved strong performance in video action recognition, but their high computational cost limits their practicality. Lightweight CNNs are more efficient but suffer from accuracy gaps. Cross-Architecture Knowledge Distillation (CAKD) addresses this by transferring knowledge from ViTs to CNNs, yet existing methods often struggle with architectural mismatch and overlook the value of stronger homogeneous CNN teachers. To tackle these challenges, we propose a Dual-Teacher Knowledge Distillation framework that leverages both a heterogeneous ViT teacher and a homogeneous CNN teacher to collaboratively guide a lightweight CNN student. We introduce two key components: (1) Discrepancy-Aware Teacher Weighting, which dynamically fuses the predictions from ViT and CNN teachers by assigning adaptive weights based on teacher confidence and prediction discrepancy with the student, enabling more informative and effective supervision; and (2) a Structure Discrepancy-Aware Distillation strategy, where the student learns the residual features between ViT and CNN teachers via a lightweight auxiliary branch, focusing on transferable architectural differences without mimicking all of ViT's high-dimensional patterns. Extensive experiments on benchmarks including HMDB51, EPIC-KITCHENS-100, and Kinetics-400 demonstrate that our method consistently outperforms state-of-the-art distillation approaches, achieving notable performance improvements with a maximum accuracy gain of 5.95% on HMDB51.
Problem

Research questions and friction points this paper is trying to address.

Transferring knowledge from Vision Transformers to lightweight CNNs for video action recognition
Addressing architectural mismatch between heterogeneous teacher-student models in distillation
Improving lightweight CNN accuracy without increasing computational costs significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-teacher framework with ViT and CNN teachers
Dynamic teacher weighting based on confidence and discrepancy
Residual feature learning for architectural differences
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Ying Peng
South China University of Technology
H
Hongsen Ye
South China University of Technology
Changxin Huang
Changxin Huang
Shenzhen University, Assistant Professor
RoboticsReinforcement learning
Xiping Hu
Xiping Hu
Professor in Beijing Institute of Technology
Cyber-Physical SystemCrowd ComputingAffective Computing
J
Jian Chen
South China University of Technology
R
Runhao Zeng
Artificial Intelligence Research Institute, Shenzhen MSU-BIT University