π€ AI Summary
This study addresses the challenge of effectively disentangling speaker timbre from speech content in open-set zero-shot voice conversion while preserving linguistic information. To this end, the authors propose a linear, invertible method that requires no additional training: a universal low-rank mapping from speech to content representations is learned via least-squares optimization, and a speaker-specific transformation is constructed using only a few seconds of target speech. This approach extends closed-set content decomposition to the open domain, enabling efficient timbre disentanglement and content preservation. Experimental results demonstrate that the method achieves intelligibility, naturalness, and speaker similarity comparable to existing approaches that rely on extensive target speech data or extra training, and it successfully facilitates speaker-conditioned text-to-speech model training.
π Abstract
We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.