🤖 AI Summary
To address the challenge of balancing inference latency and accuracy when deploying large models under resource constraints—where full-model loading incurs high startup overhead and slow initial inference—this paper proposes a progressive weight loading method. At startup, only a compact subnetwork is loaded to enable rapid initial inference; subsequently, weights are dynamically replaced layer-by-layer to gradually converge toward the teacher model’s performance. Unlike conventional knowledge distillation, this approach tightly couples distillation objectives with the loading process: intermediate feature alignment guides incremental weight replacement, enabling continuous accuracy improvement without compromising low-latency initialization. The method is architecture-agnostic, supporting VGG, ResNet, ViT, and others. Experiments demonstrate that initial inference speed matches lightweight models (e.g., MobileNet), while final accuracy approaches that of the full teacher model. In dynamic loading scenarios, it significantly improves the latency–accuracy trade-off efficiency.
📝 Abstract
Deep learning models have become increasingly large and complex, resulting in higher memory consumption and computational demands. Consequently, model loading times and initial inference latency have increased, posing significant challenges in mobile and latency-sensitive environments where frequent model loading and unloading are required, which directly impacts user experience. While Knowledge Distillation (KD) offers a solution by compressing large teacher models into smaller student ones, it often comes at the cost of reduced performance. To address this trade-off, we propose Progressive Weight Loading (PWL), a novel technique that enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model. To support seamless layer substitution, we introduce a training method that not only aligns intermediate feature representations between student and teacher layers, but also improves the overall output performance of the student model. Our experiments on VGG, ResNet, and ViT architectures demonstrate that models trained with PWL maintain competitive distillation performance and gradually improve accuracy as teacher layers are loaded-matching the final accuracy of the full teacher model without compromising initial inference speed. This makes PWL particularly suited for dynamic, resource-constrained deployments where both responsiveness and performance are critical.