StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional voice conversion (VC) methods implicitly model linguistic content, hindering simultaneous fidelity preservation of speaker identity and semantic meaning. To address this, we propose StarVC—the first unified autoregressive VC framework—introducing a novel two-stage “text-prediction-first” architecture: (1) explicit linguistic modeling via ASR-guided discrete tokens, and (2) text-conditioned autoregressive acoustic feature generation, enabling disentangled optimization of semantics and timbre. StarVC is the first VC approach to deeply integrate explicit text modeling with acoustic generation under a joint optimization training strategy. Experiments demonstrate that StarVC significantly outperforms state-of-the-art methods in word error rate (WER) and character error rate (CER), achieves a speaker embedding cosine similarity (SECS) of 0.87, and attains a mean opinion score (MOS) of 4.12—validating the substantial quality improvement enabled by language-aware modeling in VC.

Technology Category

Application Category

📝 Abstract
Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic content. Since VC fundamentally involves disentangling speaker identity from linguistic content, leveraging structured semantic features could enhance conversion performance. However, previous attempts to incorporate semantic features into VC have shown limited effectiveness, motivating the integration of explicit text modeling. We propose StarVC, a unified autoregressive VC framework that first predicts text tokens before synthesizing acoustic features. The experiments demonstrate that StarVC outperforms conventional VC methods in preserving both linguistic content (i.e., WER and CER) and speaker characteristics (i.e., SECS and MOS). Audio demo can be found at: https://thuhcsi.github.io/StarVC/.
Problem

Research questions and friction points this paper is trying to address.

Improving voice conversion by leveraging explicit text modeling
Enhancing speaker identity and linguistic content preservation
Unifying text and speech generation in autoregressive framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified autoregressive framework for VC
Predicts text tokens before synthesis
Enhances linguistic and speaker preservation
🔎 Similar Papers
No similar papers found.