Step-Audio-EditX Technical Report

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses key limitations of existing audio editing models—namely, insufficient expressiveness, limited iterative controllability, and weak zero-shot text-to-speech (TTS) capability. We propose the first open-source, end-to-end audio editing system built upon a large language model (LLM) architecture. Methodologically, we abandon conventional embedding priors and auxiliary modules, instead introducing a novel learning paradigm that relies solely on large-margin synthetic data. This enables fine-grained, multi-turn control over cross-speaker emotional prosody, intonation, and paralinguistic features—including pauses, stress, and vocal attitude—without explicit representation disentanglement. As a result, speech expressiveness and generalization are significantly enhanced. Experiments demonstrate substantial improvements over MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 on emotion editing and paralinguistic control tasks, while also achieving strong zero-shot TTS performance. Our approach establishes a new paradigm for controllable speech generation.

Technology Category

Application Category

📝 Abstract

We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

Problem

Research questions and friction points this paper is trying to address.

Develops LLM-based audio model for expressive editing

Enables emotion and speaking style control without priors

Surpasses existing models in fine-grained audio control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM-based model for expressive audio editing

Employs large-margin synthetic data without auxiliary modules

Enables iterative control and high voice expressivity

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs