MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in zero-shot voice imitation posed by the scarcity of high-quality parallel data and the audio quality bottleneck arising from using synthetic speech as the training target. The authors propose an innovative approach that reverses the conventional roles of synthetic and real speech during training—treating synthetic utterances as the source and natural recordings as the target—and integrates autoregressive modeling with interleaved text-audio modeling. Without relying on complex disentanglement architectures, the method leverages pseudo-parallel corpora and a preference-based post-training strategy to achieve alignment with human preferences. Experimental results demonstrate that the proposed approach significantly outperforms existing methods in terms of speech naturalness while preserving high fidelity in speaker identity, accent, and emotional expression.

Technology Category

Application Category

📝 Abstract
Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference's voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.
Problem

Research questions and friction points this paper is trying to address.

voice imitation
zero-shot
pseudo-parallel data
speech synthesis
timbre transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot voice imitation
pseudo-parallel speech
autoregressive modeling
preference alignment
text-audio interleaving
🔎 Similar Papers
No similar papers found.