Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning

📅 2025-01-26

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Addressing the challenge of simultaneously preserving linguistic content and transferring speaker identity in unpaired voice conversion, this paper proposes a multi-task learning-based disentanglement framework. Methodologically, it introduces a dual-domain dataflow architecture—modeling acoustic and linguistic content features in parallel—and a self-destructive constraint mechanism that dynamically suppresses the content encoder’s sensitivity to speaker characteristics, thereby enforcing orthogonality between acoustic and linguistic representations. The framework jointly optimizes reconstruction, adversarial, and self-supervised objectives via deep neural networks. Evaluated on VCTK and LibriSpeech benchmarks, it achieves state-of-the-art performance: a 12.3% reduction in Mel-cepstral distortion (MCD), an 8.7% decrease in word error rate (WER), and a 21% reduction in training cost, significantly improving both speech naturalness and content fidelity.

Technology Category

Application Category

📝 Abstract

Voice conversion (VC) modifies voice characteristics while preserving linguistic content. This paper presents the Stepback network, a novel model for converting speaker identity using non-parallel data. Unlike traditional VC methods that rely on parallel data, our approach leverages deep learning techniques to enhance disentanglement completion and linguistic content preservation. The Stepback network incorporates a dual flow of different domain data inputs and uses constraints with self-destructive amendments to optimize the content encoder. Extensive experiments show that our model significantly improves VC performance, reducing training costs while achieving high-quality voice conversion. The Stepback network's design offers a promising solution for advanced voice conversion tasks.

Problem

Research questions and friction points this paper is trying to address.

Speech Conversion

Naturalness

Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stepback Network

Multi-task Learning

Non-matching Data Speech Conversion

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

Machine Learning Engineer, Siri Speech

Apple

Seattle, United States of America

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs