🤖 AI Summary
Text-guided audio editing requires precise localization of target content while preserving the rest of the audio, yet existing trained and zero-shot methods suffer from inaccurate localization, semantic misalignment, or reliance on auxiliary annotations—especially in complex scenarios such as overlapping multi-event audio. To address this, we propose an end-to-end rectified flow matching diffusion framework—the first to introduce this efficient generative mechanism into text-driven audio editing—enabling fine-grained semantic alignment without subtitles, masks, or iterative optimization. We construct the first benchmark dataset for audio editing featuring multi-event overlaps and design a text-conditioned flow calibration strategy. Experiments demonstrate that our method achieves state-of-the-art performance across key metrics: editing fidelity, target consistency, and background preservation.
📝 Abstract
Diffusion models have shown remarkable progress in text-to-audio generation. However, text-guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest, thus demanding precise localization and faithful editing according to the text prompt. Existing training-based and zero-shot methods that rely on full-caption or costly optimization often struggle with complex editing or lack practicality. In this work, we propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing, and construct a dataset featuring overlapping multi-event audio to support training and benchmarking in complex scenarios. Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks, while maintaining competitive editing quality across metrics.