🤖 AI Summary
To address the challenge of balancing performance and efficiency in speech-language understanding (SLU) models under resource-constrained settings, this paper proposes the first multi-stage training framework that jointly optimizes knowledge distillation and neural network quantization. Departing from conventional two-stage, decoupled paradigms, our approach deeply integrates distillation with fine-grained quantization—including 1–2-bit asymmetric and layer-wise calibrated quantization—and introduces an SLU-specific loss function to enhance robustness and generalization under extreme low-bit regimes. Evaluated on SLURP and FSC benchmarks, our method achieves accuracies of 71.13% and 99.20%, respectively, while reducing computational cost by 60–73× and model size by 83–700×, with accuracy degradation bounded at ≤5.56%. This work unifies high-fidelity inference with aggressive model compression for edge-deployable SLU systems.
📝 Abstract
Spoken Language Understanding (SLU) systems must balance performance and efficiency, particularly in resource-constrained environments. Existing methods apply distillation and quantization separately, leading to suboptimal compression as distillation ignores quantization constraints. We propose QUADS, a unified framework that optimizes both through multi-stage training with a pre-tuned model, enhancing adaptability to low-bit regimes while maintaining accuracy. QUADS achieves 71.13% accuracy on SLURP and 99.20% on FSC, with only minor degradations of up to 5.56% compared to state-of-the-art models. Additionally, it reduces computational complexity by 60--73$ imes$ (GMACs) and model size by 83--700$ imes$, demonstrating strong robustness under extreme quantization. These results establish QUADS as a highly efficient solution for real-world, resource-constrained SLU applications.