🤖 AI Summary
This study addresses core challenges in accent conversion—namely, data alignment difficulties, insufficient disentanglement of representations, data scarcity, and speaker identity preservation—by systematically tracing the field’s evolution from early rule-based signal processing techniques (e.g., spectral warping and formant analysis) to modern reference-free neural voice conversion architectures. It innovatively integrates sociolinguistic perspectives with technical analysis in a problem-driven framework, clarifying task-specific constraints and requirements across diverse application scenarios while highlighting the critical trade-off between controllability and perceptual consistency. The work further reviews prevailing datasets and evaluation methodologies, ultimately proposing a forward-looking direction toward high-fidelity, identity-preserving, and controllable accent conversion, thereby offering both a theoretical framework and practical guidance for future research.
📝 Abstract
Accent conversion has rapidly progressed alongside growing interest in improving global cross-cultural communication. This survey presents an overview of the evolution of accent conversion methodologies, analyzing how the field has developed in response to fundamental challenges related to data alignment, representation disentanglement, and resource scarcity. We trace the progression from early rule-based digital signal processing approaches such as spectral manipulation and formant-based analysis to modern neural architectures capable of flexible and reference-free accent transformation. In addition, the survey situates accent conversion within its linguistic foundations and examines how different application requirements impose varying constraints on the balance between accent modification and speaker identity preservation. Finally, it reviews commonly used speech datasets and evaluation methodologies, identifies persistent challenges, and outlines directions for future research aimed at achieving more controllable and perceptually consistent accent conversion.