🤖 AI Summary
This work investigates the intrinsic unification between speech enhancement and neural vocoding, revealing their shared low-rank degradation property. To exploit this insight, we propose the first joint modeling framework that enables a single deep neural network to perform both speech denoising/enhancement and high-fidelity waveform synthesis. Methodologically, our approach leverages low-rank feature representation, multi-objective loss design, and end-to-end joint optimization—thereby uncovering and harnessing the consistent low-rank behavior of both tasks in the spectral rank space. Experiments demonstrate that the unified model achieves performance on par with dedicated task-specific models across standard metrics (PESQ, STOI, STFT-MSE), validating the hypothesis that speech restoration tasks admit a unified formulation. This work establishes a novel paradigm for speech processing by bridging traditionally disjoint subfields through a principled low-rank perspective.
📝 Abstract
Speech enhancement (SE) and neural vocoding are traditionally viewed as separate tasks. In this work, we observe them under a common thread: the rank behavior of these processes. This observation prompts two key questions: extit{Can a model designed for one task's rank degradation be adapted for the other?} and extit{Is it possible to address both tasks using a unified model?} Our empirical findings demonstrate that existing speech enhancement models can be successfully trained to perform vocoding tasks, and a single model, when jointly trained, can effectively handle both tasks with performance comparable to separately trained models. These results suggest that speech enhancement and neural vocoding can be unified under a broader framework of speech restoration. Code: https://github.com/Andong-Li-speech/Neural-Vocoders-as-Speech-Enhancers.