🤖 AI Summary
Traditional speech enhancement methods rely on handcrafted or pretrained feature-based losses, limiting their ability to model fine-grained signal characteristics. To address this, we propose a “model-as-loss” self-consistency training paradigm: the encoder of an end-to-end differentiable encoder-decoder network serves as a dynamic, task-aware loss function, constructing a self-supervised objective in a discriminative feature space that enforces intrinsic consistency between enhanced outputs and clean speech. Crucially, this loss is fully internal—requiring no external pretrained models (e.g., WavLM or wav2vec)—and emerges solely from the current network’s own representation. Experiments demonstrate that our approach surpasses state-of-the-art deep feature losses based on WavLM/wav2vec on standard benchmarks, yields significant improvements in subjective speech quality (e.g., PESQ, STOI, and MOS), and exhibits superior generalization both within-domain and cross-domain.
📝 Abstract
Conventional methods for speech enhancement rely on handcrafted loss functions (e.g., time or frequency domain losses) or deep feature losses (e.g., using WavLM or wav2vec), which often fail to capture subtle signal properties essential for optimal performance. To address this, we propose Model as Loss, a novel training paradigm that utilizes the encoder from the same model as a loss function to guide the training. The Model as Loss paradigm leverages the encoder's task-specific feature space, optimizing the decoder to produce output consistent with perceptual and task-relevant characteristics of the clean signal. By using the encoder's learned features as a loss function, this framework enforces self-consistency between the clean reference speech and the enhanced model output. Our approach outperforms pre-trained deep feature losses on standard speech enhancement benchmarks, offering better perceptual quality and robust generalization to both in-domain and out-of-domain datasets.