🤖 AI Summary
In low-bit quantization-aware training (QAT) combined with knowledge distillation (KD), optimization conflicts arise due to heterogeneous gradient magnitudes between task loss and distillation loss. Method: We propose the Game of Regularizer (GoR), a lightweight dynamic regularization mechanism with only two learnable parameters, which adaptively balances supervision signals to mitigate gradient conflict. Building upon GoR, we introduce QAT-EKD-GoR—a unified framework supporting ensemble KD from multiple teachers. Contribution/Results: QAT-EKD-GoR achieves state-of-the-art performance across image classification, object detection, and large language model compression. In several cases, it even surpasses full-precision baselines in accuracy while significantly improving inference efficiency on edge devices—effectively reconciling high accuracy with ultra-low power consumption.
📝 Abstract
Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under low-bit quantization. We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small quantized models (SQMs). Experiments on image classification, object detection (OD), and large language model (LLM) compression show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.