SLAD : Shared LoRA Adapters for Task Specific Distillation

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency in task-specific knowledge distillation caused by misaligned feature representations between a fine-tuned teacher model and a student model. To overcome this limitation, the authors propose a joint training framework that enforces feature alignment between teacher and student through shared low-rank adapters (LoRA), replacing conventional fine-tuning or linear probing strategies. The approach not only substantially improves student model performance but also reciprocally enhances the teacher’s accuracy, while accelerating training by a factor of two. Evaluated across multiple image classification and segmentation benchmarks, the method achieves state-of-the-art results in task-specific knowledge distillation.
📝 Abstract
In the context of resource-constrained environments such as embedded systems, adapting reduced-size foundation models to downstream tasks has become increasingly popular. This has recently motivated the emerging setting of task-specific distillation, where a larger and a smaller version of the same foundation model are both adapted to the same downstream task, with the goal of transferring knowledge from the former to the latter. Recent work has demonstrated the benefits of using a larger version of the same foundation model to assist the adaptation of a smaller one. Typically, the larger model (teacher) is first adapted via fine-tuning or linear probing before its knowledge is distilled into the smaller model (student). While fine-tuning the teacher often increases its performance, recent work showed that probing it leads to better knowledge distillation to the student. Our findings show that this is mainly due to a mis-alignment in feature representation between the teacher and the student which occurs during the teacher's fine-tuning. Inspired by existing efforts to preserve previously learned knowledge, we first propose leveraging low-rank adaptation, resulting in better feature alignment and therefore better knowledge transfer. Drawing from this insight, we further enhance the feature alignment through a parameter-sharing strategy of the adapters between the two encoders during joint training. Our proposed method, SLAD, shows better feature alignment between the teacher and student, which results in increased performance for not only the student but also the teacher model, while being 2x faster to train than fine-tuning. Through extensive experiments on multiple classification and segmentation datasets, we demonstrate the improved accuracy and transfer efficiency of our method, achieving state-of-the-art performance in the task-specific distillation framework.
Problem

Research questions and friction points this paper is trying to address.

task-specific distillation
feature alignment
knowledge distillation
foundation models
resource-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA
knowledge distillation
feature alignment
parameter sharing
task-specific distillation