RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-guided audio editing requires precise localization of target content while preserving the rest of the audio, yet existing trained and zero-shot methods suffer from inaccurate localization, semantic misalignment, or reliance on auxiliary annotations—especially in complex scenarios such as overlapping multi-event audio. To address this, we propose an end-to-end rectified flow matching diffusion framework—the first to introduce this efficient generative mechanism into text-driven audio editing—enabling fine-grained semantic alignment without subtitles, masks, or iterative optimization. We construct the first benchmark dataset for audio editing featuring multi-event overlaps and design a text-conditioned flow calibration strategy. Experiments demonstrate that our method achieves state-of-the-art performance across key metrics: editing fidelity, target consistency, and background preservation.

Technology Category

Application Category

📝 Abstract
Diffusion models have shown remarkable progress in text-to-audio generation. However, text-guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest, thus demanding precise localization and faithful editing according to the text prompt. Existing training-based and zero-shot methods that rely on full-caption or costly optimization often struggle with complex editing or lack practicality. In this work, we propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing, and construct a dataset featuring overlapping multi-event audio to support training and benchmarking in complex scenarios. Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks, while maintaining competitive editing quality across metrics.
Problem

Research questions and friction points this paper is trying to address.

Text-guided audio editing with precise localization
Faithful semantic alignment without auxiliary captions
Efficient diffusion framework for complex audio scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rectified flow matching diffusion framework
Dataset with overlapping multi-event audio
Semantic alignment without auxiliary captions
🔎 Similar Papers
No similar papers found.
L
Liting Gao
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, United Kingdom
Yi Yuan
Yi Yuan
NetEase Fuxi AI Lab
deep learningcomputer vision
Yaru Chen
Yaru Chen
Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey
Multi-modal learningComputer vision
Y
Yuelan Cheng
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, United Kingdom
Z
Zhenbo Li
College of Information and Electrical Engineering, China Agricultural University, China
J
Juan Wen
College of Information and Electrical Engineering, China Agricultural University, China
S
Shubin Zhang
Fisheries College, Ocean University of China, China
Wenwu Wang
Wenwu Wang
Professor, University of Surrey, UK
signal processingmachine learningmachine listeningaudio/speech/audio-visualmultimodal fusion