🤖 AI Summary
This work addresses key limitations of existing audio editing models—namely, insufficient expressiveness, limited iterative controllability, and weak zero-shot text-to-speech (TTS) capability. We propose the first open-source, end-to-end audio editing system built upon a large language model (LLM) architecture. Methodologically, we abandon conventional embedding priors and auxiliary modules, instead introducing a novel learning paradigm that relies solely on large-margin synthetic data. This enables fine-grained, multi-turn control over cross-speaker emotional prosody, intonation, and paralinguistic features—including pauses, stress, and vocal attitude—without explicit representation disentanglement. As a result, speech expressiveness and generalization are significantly enhanced. Experiments demonstrate substantial improvements over MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 on emotion editing and paralinguistic control tasks, while also achieving strong zero-shot TTS performance. Our approach establishes a new paradigm for controllable speech generation.
📝 Abstract
We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.