Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

In low-bit quantization-aware training (QAT) combined with knowledge distillation (KD), optimization conflicts arise due to heterogeneous gradient magnitudes between task loss and distillation loss. Method: We propose the Game of Regularizer (GoR), a lightweight dynamic regularization mechanism with only two learnable parameters, which adaptively balances supervision signals to mitigate gradient conflict. Building upon GoR, we introduce QAT-EKD-GoR—a unified framework supporting ensemble KD from multiple teachers. Contribution/Results: QAT-EKD-GoR achieves state-of-the-art performance across image classification, object detection, and large language model compression. In several cases, it even surpasses full-precision baselines in accuracy while significantly improving inference efficiency on edge devices—effectively reconciling high accuracy with ultra-low power consumption.

Technology Category

Application Category

📝 Abstract

Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under low-bit quantization. We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small quantized models (SQMs). Experiments on image classification, object detection (OD), and large language model (LLM) compression show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Balancing task-specific and distillation losses in quantized model training

Improving performance of small quantized models under low-bit constraints

Enabling efficient AI deployment on resource-constrained edge devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable regularization adaptively balances distillation objectives

Two trainable parameters enable dynamic loss weighting

Ensemble distillation uses multiple heterogeneous teacher models

🔎 Similar Papers

No similar papers found.