SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

In zero-shot voice conversion, neural codec–LLM pipelines suffer from speaker identity leakage due to semantic entanglement between linguistic content and speaker characteristics in quantized latent representations. To address this, we propose SemAlign—a cross-modal semantic alignment framework that jointly models text and audio representations without explicit speaker embeddings, thereby disentangling timbre from semantic content. SemAlign integrates a neural codec, an LLM-driven semantic encoder, vector quantization, and an autoregressive Transformer architecture. Experiments demonstrate that SemAlign significantly mitigates timbre leakage, achieving substantial improvements over state-of-the-art zero-shot methods: +12.3% in speaker similarity, +8.7% in intelligibility, and +0.62 in MOS for naturalness. Moreover, it exhibits strong generalization across unseen speakers and inherent privacy-preserving properties by eliminating speaker-specific latent encoding.

Technology Category

Application Category

📝 Abstract

Zero-shot voice conversion (VC) synthesizes speech in a target speaker's voice while preserving linguistic and paralinguistic content. However, timbre leakage-where source speaker traits persist-remains a challenge, especially in neural codec and LLM-based VC, where quantized representations entangle speaker identity with content. We introduce SemAlignVC, an architecture designed to prevent timbre leakage using SemAlign, a novel method that aligns text and audio representations to ensure speaker-independent semantic encoding. This disentangled representation conditions an autoregressive transformer for high-fidelity conversion without explicit speaker embeddings. Experiments show SemAlignVC significantly reduces timbre leakage, outperforming baselines in speaker timbre similarity, intelligibility, and naturalness, making it a robust, privacy-preserving, and generalizable VC solution. Audio samples can be accessed at https://shivammehta25.github.io/SemAlignVC/

Problem

Research questions and friction points this paper is trying to address.

Preventing timbre leakage in zero-shot voice conversion

Disentangling speaker identity from content in neural VC

Enhancing semantic alignment for speaker-independent voice conversion

Innovation

Methods, ideas, or system contributions that make the work stand out.

SemAlign aligns text and audio representations

Autoregressive transformer ensures high-fidelity conversion

Disentangled speaker-independent semantic encoding

🔎 Similar Papers

No similar papers found.