🤖 AI Summary
This work addresses the challenges in zero-shot voice imitation posed by the scarcity of high-quality parallel data and the audio quality bottleneck arising from using synthetic speech as the training target. The authors propose an innovative approach that reverses the conventional roles of synthetic and real speech during training—treating synthetic utterances as the source and natural recordings as the target—and integrates autoregressive modeling with interleaved text-audio modeling. Without relying on complex disentanglement architectures, the method leverages pseudo-parallel corpora and a preference-based post-training strategy to achieve alignment with human preferences. Experimental results demonstrate that the proposed approach significantly outperforms existing methods in terms of speech naturalness while preserving high fidelity in speaker identity, accent, and emotional expression.
📝 Abstract
Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference's voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.