🤖 AI Summary
This work addresses the challenge of simultaneously preserving speaker identity, prosodic continuity, and phonemic accuracy in real-time non-native accent conversion. To this end, we propose the first streaming non-autoregressive accent conversion framework. Methodologically: (1) we design a low-latency streaming encoder based on Emformer to enable incremental acoustic modeling; (2) we introduce a joint TTS-auxiliary training paradigm, leveraging high-fidelity native-TTS outputs to provide differentiable supervision that enhances pronunciation naturalness and accuracy; (3) we optimize the non-autoregressive inference mechanism to achieve stable, end-to-end latency under 300 ms. Experiments demonstrate that our system significantly outperforms existing streaming methods in pronunciation quality while maintaining speaker identity and prosody, matching the performance of state-of-the-art offline models. It is the first streaming accent conversion system that jointly achieves practical deployability and academic novelty.
📝 Abstract
We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.