MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing text-guided audio editing methods face three key limitations: (1) training-free approaches suffer from audio quality degradation due to diffusion inversion; (2) supervised methods are constrained by the scarcity of high-quality paired data and limited edit types; and (3) modality-decoupled architectures struggle to achieve fine-grained alignment between natural language instructions and acoustic features. This work first formally defines a comprehensive audio editing benchmark covering five fundamental operations—addition, replacement, deletion, reordering, and attribute modification—and introduces an event-level fine-grained annotation paradigm for synthetic data generation. We further propose a unified editing architecture featuring deep audio-language alignment: a Qwen2-Audio encoder, an MMDiT-based generator, and a custom joint instruction-tuning strategy. Experiments demonstrate 98.7% fidelity preservation in unedited regions, a 12.4% average improvement in editing accuracy over SOTA, and significant gains in spatial localization precision and instruction-following robustness.

Technology Category

Application Category

📝 Abstract

Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal degradation caused by diffusion inversion, while training-based methods, although achieving higher generation quality, are severely constrained by the scarcity of high-quality paired data and task formulations that cover only a narrow subset of editing operations. In addition, standard architectures typically decouple text and audio processing, limiting the ability to align instructions with specific acoustic contexts. To address these challenges, we propose MMEdit, an audio-language-model-driven framework for unified audio editing. We systematically extend task definitions to cover a comprehensive range of editing operations, including addition, replacement, removal, reordering, and attribute modification. Furthermore, we design a scalable data synthesis pipeline to construct large-scale paired datasets with fine-grained event-level annotations. To capture complex editing semantics, we integrate a Qwen2-Audio encoder with an MMDiT-based generator, enabling precise cross-modal alignment and localized editing. Experimental results demonstrate that our method achieves superior editing localization accuracy, robust instruction following, and high fidelity in non-edited regions.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in text-guided audio editing methods

Extends task definitions to cover comprehensive audio editing operations

Enables precise cross-modal alignment for localized audio editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified audio-language model for multi-type editing operations

Scalable data synthesis pipeline with fine-grained event annotations

Qwen2-Audio encoder with MMDiT generator for cross-modal alignment

🔎 Similar Papers

No similar papers found.

Authors to Follow