AccentBox: Towards High-Fidelity Zero-Shot Accent Generation

๐Ÿ“… 2024-09-13
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing zero-shot text-to-speech (TTS) methods suffer from low fidelity and poor controllability in unseen accent modeling. To address this, we propose a two-stage unified framework integrating foreign-accent conversion, accented TTS, and zero-shot TTS. Our key contributions are: (1) the first accent-aware zero-shot generation paradigm; (2) a speaker-independent pre-trained accent embedding space enabling zero-shot synthesis and cross-accent generation for previously unseen accents; and (3) accent representation extraction using a state-of-the-art accent recognition model, coupled with multi-task learning and disentangled representation learning for conditional generation. Experiments demonstrate an accent recognition F1 score of 0.56 on unseen speakers, significant improvements in both intrinsic and cross-accent synthesis fidelity, andโ€”cruciallyโ€”the first realization of genuine zero-shot speech synthesis for entirely unseen accents.

Technology Category

Application Category

๐Ÿ“ Abstract
While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition a ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.
Problem

Research questions and friction points this paper is trying to address.

Accent Variation
Speech Synthesis
Unsupervised Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Accent Mimicry
Unseen Accent Learning
Text-to-Speech Adaptation
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jinzuomu Zhong
Centre for Speech Technology Research, University of Edinburgh, UK
Korin Richmond
Korin Richmond
Centre for Speech Technology Research, University of Edinburgh
Speech synthesisarticulatory modellingarticulatory-acoustic relationshiplexicography
Z
Zhiba Su
Department of AI Technology, Transsion, China
S
Siqi Sun
Centre for Speech Technology Research, University of Edinburgh, UK