🤖 AI Summary
To address model lightweighting for speech enhancement under resource-constrained conditions, this paper proposes a knowledge distillation framework integrating time-frequency calibration across intra- and inter-set domains. Methodologically, it introduces a novel dual-stream time-frequency cross-calibrated weighted distillation mechanism, synergistically combining intra-set paired matching with inter-set residual fusion to enable fine-grained, cross-layer, and cross-set knowledge transfer. The approach is adapted to the DPDCRN architecture via multi-level interactive distillation and residual-fusion-based feature aggregation. Evaluated on the L3DAS23 SE track champion model, the student model achieves a 47% reduction in computational complexity while significantly outperforming mainstream distillation methods in PESQ and STOI—nearly matching teacher-model performance. The core contribution lies in a new distillation paradigm jointly modeling time-frequency dynamics and structural characteristics, rigorously validated for efficiency in real-world low-latency deployment scenarios.
📝 Abstract
In recent years, complexity compression of neural network (NN)-based speech enhancement (SE) models has gradually attracted the attention of researchers, especially in scenarios with limited hardware resources or strict latency requirements. The main difficulties and challenges lie in achieving a balance between complexity and performance according to the characteristics of the task. In this paper, we propose an intra-inter set knowledge distillation (KD) framework with time-frequency calibration (I$^2$S-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully utilizes the time-frequency differential information of speech while promoting global knowledge flow. Firstly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. Secondly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through residual fusion to form the fused feature set that enables inter-set knowledge interaction. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.