🤖 AI Summary
Existing accent conversion (AC) methods—particularly foreign accent conversion (FAC)—lack explicit control over conversion strength, hindering simultaneous achievement of accurate accent modification and speaker identity preservation. To address this, we propose the first controllable zero-shot FAC framework. Our approach leverages a factorized speech codec to disentangle speech representations into three orthogonal components: linguistic content, prosody (pitch contour and phoneme duration), and speaker identity. By introducing explicit, user-controllable accent modification parameters, our method enables targeted adjustment of phonetic features while strictly preserving suprasegmental prosodic cues and speaker-specific characteristics. Experiments demonstrate that our framework achieves conversion quality on par with state-of-the-art systems, while maintaining superior speaker consistency. Crucially, it supports fine-grained, interpretable, and user-adjustable control over accent strength—enabling personalized, intensity-tuned FAC without requiring parallel or speaker-specific training data.
📝 Abstract
Previous accent conversion (AC) methods, including foreign accent conversion (FAC), lack explicit control over the degree of modification. Because accent modification can alter the perceived speaker identity, balancing conversion strength and identity preservation is crucial. We present an AC framework that provides an explicit, user-controllable parameter for accent modification. The method targets pronunciation while preserving suprasegmental cues such as intonation and phoneme durations. Results show performance comparable to recent AC systems, stronger preservation of speaker identity, and unique support for controllable accent conversion.