I$^2$S-TFCKD: Intra-Inter Set Knowledge Distillation with Time-Frequency Calibration for Speech Enhancement

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address model lightweighting for speech enhancement under resource-constrained conditions, this paper proposes a knowledge distillation framework integrating time-frequency calibration across intra- and inter-set domains. Methodologically, it introduces a novel dual-stream time-frequency cross-calibrated weighted distillation mechanism, synergistically combining intra-set paired matching with inter-set residual fusion to enable fine-grained, cross-layer, and cross-set knowledge transfer. The approach is adapted to the DPDCRN architecture via multi-level interactive distillation and residual-fusion-based feature aggregation. Evaluated on the L3DAS23 SE track champion model, the student model achieves a 47% reduction in computational complexity while significantly outperforming mainstream distillation methods in PESQ and STOI—nearly matching teacher-model performance. The core contribution lies in a new distillation paradigm jointly modeling time-frequency dynamics and structural characteristics, rigorously validated for efficiency in real-world low-latency deployment scenarios.

Technology Category

Application Category

📝 Abstract
In recent years, complexity compression of neural network (NN)-based speech enhancement (SE) models has gradually attracted the attention of researchers, especially in scenarios with limited hardware resources or strict latency requirements. The main difficulties and challenges lie in achieving a balance between complexity and performance according to the characteristics of the task. In this paper, we propose an intra-inter set knowledge distillation (KD) framework with time-frequency calibration (I$^2$S-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully utilizes the time-frequency differential information of speech while promoting global knowledge flow. Firstly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. Secondly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through residual fusion to form the fused feature set that enables inter-set knowledge interaction. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.
Problem

Research questions and friction points this paper is trying to address.

Balancing complexity and performance in neural network-based speech enhancement
Utilizing time-frequency differential information for refined knowledge distillation
Improving low-complexity student models via intra-inter set knowledge interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream time-frequency cross-calibration for distillation
Intra-inter set feature fusion for knowledge interaction
Multi-layer interactive distillation based on speech characteristics
🔎 Similar Papers
No similar papers found.
Jiaming Cheng
Jiaming Cheng
Arizona State University
Network economicsOptimizationSustainable computing
Ruiyu Liang
Ruiyu Liang
UNSW
Data visualisationdata analyticsVisual analyticsdata fusiongeohazard management
C
Chao Xu
School of Computer Science, Nanjing Audit University, Nanjing 211815, China
Y
Ye Ni
School of Information Science and Engineering, Southeast University, Nanjing 210096, China
W
Wei Zhou
Cardiff University, CF10 3AT, United Kingdom
B
Bjorn W. Schuller
CHI – the Chair of Health Informatics, TUM University Hospital, Germany and also with GLAM – the Group on Language, Audio, & Music, Imperial College London, UK
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language