RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Text-guided audio editing requires precise localization of target content while preserving the rest of the audio, yet existing trained and zero-shot methods suffer from inaccurate localization, semantic misalignment, or reliance on auxiliary annotations—especially in complex scenarios such as overlapping multi-event audio. To address this, we propose an end-to-end rectified flow matching diffusion framework—the first to introduce this efficient generative mechanism into text-driven audio editing—enabling fine-grained semantic alignment without subtitles, masks, or iterative optimization. We construct the first benchmark dataset for audio editing featuring multi-event overlaps and design a text-conditioned flow calibration strategy. Experiments demonstrate that our method achieves state-of-the-art performance across key metrics: editing fidelity, target consistency, and background preservation.

Technology Category

Application Category

📝 Abstract

Diffusion models have shown remarkable progress in text-to-audio generation. However, text-guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest, thus demanding precise localization and faithful editing according to the text prompt. Existing training-based and zero-shot methods that rely on full-caption or costly optimization often struggle with complex editing or lack practicality. In this work, we propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing, and construct a dataset featuring overlapping multi-event audio to support training and benchmarking in complex scenarios. Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks, while maintaining competitive editing quality across metrics.

Problem

Research questions and friction points this paper is trying to address.

Text-guided audio editing with precise localization

Faithful semantic alignment without auxiliary captions

Efficient diffusion framework for complex audio scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rectified flow matching diffusion framework

Dataset with overlapping multi-event audio

Semantic alignment without auxiliary captions

🔎 Similar Papers

No similar papers found.