MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-guided audio editing methods face three key limitations: (1) training-free approaches suffer from audio quality degradation due to diffusion inversion; (2) supervised methods are constrained by the scarcity of high-quality paired data and limited edit types; and (3) modality-decoupled architectures struggle to achieve fine-grained alignment between natural language instructions and acoustic features. This work first formally defines a comprehensive audio editing benchmark covering five fundamental operations—addition, replacement, deletion, reordering, and attribute modification—and introduces an event-level fine-grained annotation paradigm for synthetic data generation. We further propose a unified editing architecture featuring deep audio-language alignment: a Qwen2-Audio encoder, an MMDiT-based generator, and a custom joint instruction-tuning strategy. Experiments demonstrate 98.7% fidelity preservation in unedited regions, a 12.4% average improvement in editing accuracy over SOTA, and significant gains in spatial localization precision and instruction-following robustness.

Technology Category

Application Category

📝 Abstract
Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal degradation caused by diffusion inversion, while training-based methods, although achieving higher generation quality, are severely constrained by the scarcity of high-quality paired data and task formulations that cover only a narrow subset of editing operations. In addition, standard architectures typically decouple text and audio processing, limiting the ability to align instructions with specific acoustic contexts. To address these challenges, we propose MMEdit, an audio-language-model-driven framework for unified audio editing. We systematically extend task definitions to cover a comprehensive range of editing operations, including addition, replacement, removal, reordering, and attribute modification. Furthermore, we design a scalable data synthesis pipeline to construct large-scale paired datasets with fine-grained event-level annotations. To capture complex editing semantics, we integrate a Qwen2-Audio encoder with an MMDiT-based generator, enabling precise cross-modal alignment and localized editing. Experimental results demonstrate that our method achieves superior editing localization accuracy, robust instruction following, and high fidelity in non-edited regions.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in text-guided audio editing methods
Extends task definitions to cover comprehensive audio editing operations
Enables precise cross-modal alignment for localized audio editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified audio-language model for multi-type editing operations
Scalable data synthesis pipeline with fine-grained event annotations
Qwen2-Audio encoder with MMDiT generator for cross-modal alignment
🔎 Similar Papers
No similar papers found.
Y
Ye Tao
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Xuenan Xu
Xuenan Xu
Shanghai Jiao Tong University
audio generationaudio understandingspeech synthesis
W
Wen Wu
Shanghai AI Laboratory
S
Shuai Wang
Nanjing University
Mengyue Wu
Mengyue Wu
Shanghai Jiao Tong University
Speech perception and productionaffective computingaudio cognition
C
Chao Zhang
Shanghai AI Laboratory