Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
This work addresses the challenge that existing memory-efficient transfer learning methods incur additional overhead during inference due to retained lightweight subnetworks, making it difficult to simultaneously optimize fine-tuning and inference efficiency. To resolve this, the authors propose a Masked Dual-Path Distillation framework: during fine-tuning, the backbone network is frozen while a learnable subnetwork engages in bidirectional feature-level knowledge distillation with it; at inference time, the subnetwork is entirely removed, enabling lossless acceleration. This approach is the first to unify high efficiency in both fine-tuning and inference, supporting multi-layer encoders and cross-modal adaptation. Evaluated across vision, language, and multimodal tasks, it achieves at least a 25.2% speedup in inference, matches state-of-the-art methods in parameter and memory overhead, and significantly improves accuracy.

Technology Category

Application Category

📝 Abstract
Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2\% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at https://github.com/Zhang-VKk/MDPD.
Problem

Research questions and friction points this paper is trying to address.

Memory-efficient transfer learning
Side networks
Inference overhead
Parameter efficiency
Knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Dual Path Distillation
Memory-Efficient Transfer Learning
Fading Side Networks
Feature-Based Knowledge Distillation
Inference Acceleration
🔎 Similar Papers
No similar papers found.
Y
Yutong Zhang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China; School of Computer Science and Engineering, Beihang University, China
J
Jiaxin Chen
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China; School of Computer Science and Engineering, Beihang University, China
H
Honglin Chen
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China; School of Computer Science and Engineering, Beihang University, China
K
Kaiqi Zheng
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China; School of Computer Science and Engineering, Beihang University, China
S
Shengcai Liao
College of Information Technology, United Arab Emirates University, United Arab Emirates
H
Hanwen Zhong
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China; School of Computer Science and Engineering, Beihang University, China
Weixin Li
Weixin Li
Associate Professor, Beihang University
Computer VisionBig Data Analytics
Yunhong Wang
Yunhong Wang
Professor, School of Computer Science and Engineering, Beihang University
BiometricsPattern RecognitionImage ProcessingComputer Vision