AccentBox: Towards High-Fidelity Zero-Shot Accent Generation

📅 2024-09-13

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing zero-shot text-to-speech (TTS) methods suffer from low fidelity and poor controllability in unseen accent modeling. To address this, we propose a two-stage unified framework integrating foreign-accent conversion, accented TTS, and zero-shot TTS. Our key contributions are: (1) the first accent-aware zero-shot generation paradigm; (2) a speaker-independent pre-trained accent embedding space enabling zero-shot synthesis and cross-accent generation for previously unseen accents; and (3) accent representation extraction using a state-of-the-art accent recognition model, coupled with multi-task learning and disentangled representation learning for conditional generation. Experiments demonstrate an accent recognition F1 score of 0.56 on unseen speakers, significant improvements in both intrinsic and cross-accent synthesis fidelity, and—crucially—the first realization of genuine zero-shot speech synthesis for entirely unseen accents.

Technology Category

Application Category

📝 Abstract

While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition a ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.

Problem

Research questions and friction points this paper is trying to address.

Accent Variation

Speech Synthesis

Unsupervised Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Accent Mimicry

Unseen Accent Learning

Text-to-Speech Adaptation

🔎 Similar Papers

No similar papers found.