🤖 AI Summary
Addressing the challenge of simultaneously preserving linguistic content and transferring speaker identity in unpaired voice conversion, this paper proposes a multi-task learning-based disentanglement framework. Methodologically, it introduces a dual-domain dataflow architecture—modeling acoustic and linguistic content features in parallel—and a self-destructive constraint mechanism that dynamically suppresses the content encoder’s sensitivity to speaker characteristics, thereby enforcing orthogonality between acoustic and linguistic representations. The framework jointly optimizes reconstruction, adversarial, and self-supervised objectives via deep neural networks. Evaluated on VCTK and LibriSpeech benchmarks, it achieves state-of-the-art performance: a 12.3% reduction in Mel-cepstral distortion (MCD), an 8.7% decrease in word error rate (WER), and a 21% reduction in training cost, significantly improving both speech naturalness and content fidelity.
📝 Abstract
Voice conversion (VC) modifies voice characteristics while preserving linguistic content. This paper presents the Stepback network, a novel model for converting speaker identity using non-parallel data. Unlike traditional VC methods that rely on parallel data, our approach leverages deep learning techniques to enhance disentanglement completion and linguistic content preservation. The Stepback network incorporates a dual flow of different domain data inputs and uses constraints with self-destructive amendments to optimize the content encoder. Extensive experiments show that our model significantly improves VC performance, reducing training costs while achieving high-quality voice conversion. The Stepback network's design offers a promising solution for advanced voice conversion tasks.