Model as Loss: A Self-Consistent Training Paradigm

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional speech enhancement methods rely on handcrafted or pretrained feature-based losses, limiting their ability to model fine-grained signal characteristics. To address this, we propose a “model-as-loss” self-consistency training paradigm: the encoder of an end-to-end differentiable encoder-decoder network serves as a dynamic, task-aware loss function, constructing a self-supervised objective in a discriminative feature space that enforces intrinsic consistency between enhanced outputs and clean speech. Crucially, this loss is fully internal—requiring no external pretrained models (e.g., WavLM or wav2vec)—and emerges solely from the current network’s own representation. Experiments demonstrate that our approach surpasses state-of-the-art deep feature losses based on WavLM/wav2vec on standard benchmarks, yields significant improvements in subjective speech quality (e.g., PESQ, STOI, and MOS), and exhibits superior generalization both within-domain and cross-domain.

Technology Category

Application Category

📝 Abstract
Conventional methods for speech enhancement rely on handcrafted loss functions (e.g., time or frequency domain losses) or deep feature losses (e.g., using WavLM or wav2vec), which often fail to capture subtle signal properties essential for optimal performance. To address this, we propose Model as Loss, a novel training paradigm that utilizes the encoder from the same model as a loss function to guide the training. The Model as Loss paradigm leverages the encoder's task-specific feature space, optimizing the decoder to produce output consistent with perceptual and task-relevant characteristics of the clean signal. By using the encoder's learned features as a loss function, this framework enforces self-consistency between the clean reference speech and the enhanced model output. Our approach outperforms pre-trained deep feature losses on standard speech enhancement benchmarks, offering better perceptual quality and robust generalization to both in-domain and out-of-domain datasets.
Problem

Research questions and friction points this paper is trying to address.

Replaces handcrafted losses with self-consistent model-based loss
Improves speech enhancement via encoder-guided task-specific features
Enhances perceptual quality and generalization across datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses encoder as loss function
Ensures self-consistent feature space
Improves perceptual quality generalization
🔎 Similar Papers
No similar papers found.
S
Saisamarth Rajesh Phaye
Audio Machine Learning, Logitech
Milos Cernak
Milos Cernak
Logitech, EPFL - Quartier de l'Innovation
Meeting SpeechSpeech Analysis-Synthesis and CodingPathological Speech ProcessingArtificial Intelligence
A
Andrew Harper
Audio Machine Learning, Logitech