🤖 AI Summary
To address the high entry barrier for beginners in multimodal audio generation and the scarcity of high-quality multilingual data and unified toolchains, this paper introduces Amphion v0.2, an open-source audio generation toolkit. Methodologically, it constructs the first 100K-hour open-source multilingual speech dataset; designs a unified preprocessing pipeline and novel model architectures tailored for text-to-speech (TTS), neural audio codec, and non-parallel voice conversion; and provides end-to-end PyTorch implementations. Key contributions include: significantly improved cross-lingual TTS naturalness; state-of-the-art audio reconstruction fidelity in neural coding; voice conversion MOS scores ≥ 4.0; and comprehensive, modular APIs with systematic tutorials—substantially lowering research and deployment costs. Amphion v0.2 has gained widespread adoption in the research and engineering communities.
📝 Abstract
Amphion is an open-source toolkit for Audio, Music, and Speech Generation, designed to lower the entry barrier for junior researchers and engineers in these fields. It provides a versatile framework that supports a variety of generation tasks and models. In this report, we introduce Amphion v0.2, the second major release developed in 2024. This release features a 100K-hour open-source multilingual dataset, a robust data preparation pipeline, and novel models for tasks such as text-to-speech, audio coding, and voice conversion. Furthermore, the report includes multiple tutorials that guide users through the functionalities and usage of the newly released models.