SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing audio editing methods heavily rely on full re-description or predefined instructions, lacking natural language–driven, flexible fine-grained editing capabilities. To address this, we propose the first free-text–driven audio editing model built upon Stable Audio Open. We construct the first audio editing triplet dataset—comprising original audio, natural language editing instructions, and corresponding edited audio—by integrating Prompt-to-Prompt guidance with DDPM inversion, augmented by human verification for high-quality training data generation. Our method enables end-to-end, instruction-aligned generative editing and demonstrates strong generalization to real-world audio and unseen editing instructions. Quantitatively, it significantly outperforms baselines across standard objective metrics (e.g., STOI, ESTOI, PESQ). Subjective listening evaluations further confirm its state-of-the-art performance. The model and code are publicly released.

Technology Category

Application Category

📝 Abstract

Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.

Problem

Research questions and friction points this paper is trying to address.

Editing existing audio using free-form natural language instructions

Overcoming limitations of predefined edit instructions lacking flexibility

Creating model that generalizes to real audio and unseen edits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model edits audio using free-form natural language instructions

Training uses audio triplets from Prompt-to-Prompt and DDPM inversion

Generalizes well to real-world audio and unseen edit instructions

🔎 Similar Papers

Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience