Guiding Audio Editing with Audio Language Model

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Current audio editing models are constrained by templated instructions and monaural processing, limiting support for declarative, stereo audio editing where users simply specify desired auditory outcomes. To address this, we propose SmartDJ—a novel framework that synergistically integrates the reasoning capability of Audio Language Models (ALMs) with the generative capacity of Latent Diffusion Models (LDMs). Specifically, the ALM parses high-level natural-language instructions and decomposes them into atomic operations—such as sound addition, removal, or spatial repositioning—while the LDM executes high-fidelity stereo editing using synthetically generated training data. We introduce an automated pipeline to construct instruction–operation–audio triplets for scalable data synthesis. Extensive experiments demonstrate that SmartDJ significantly outperforms baseline methods in perceptual quality, spatial realism, and semantic alignment. By transcending conventional template-based paradigms, SmartDJ enables natural, precise, free-form audio editing—offering a promising solution for immersive applications including VR/AR and virtual conferencing.

Technology Category

Application Category

📝 Abstract

Audio editing plays a central role in VR/AR immersion, virtual conferencing, sound design, and other interactive media. However, recent generative audio editing models depend on template-like instruction formats and are restricted to mono-channel audio. These models fail to deal with declarative audio editing, where the user declares what the desired outcome should be, while leaving the details of editing operations to the system. We introduce SmartDJ, a novel framework for stereo audio editing that combines the reasoning capability of audio language models with the generative power of latent diffusion. Given a high-level instruction, SmartDJ decomposes it into a sequence of atomic edit operations, such as adding, removing, or spatially relocating events. These operations are then executed by a diffusion model trained to manipulate stereo audio. To support this, we design a data synthesis pipeline that produces paired examples of high-level instructions, atomic edit operations, and audios before and after each edit operation. Experiments demonstrate that SmartDJ achieves superior perceptual quality, spatial realism, and semantic alignment compared to prior audio editing methods. Demos are available at https://zitonglan.github.io/project/smartdj/smartdj.html.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of template-dependent mono-channel audio editing models

Enables declarative audio editing through high-level user instructions

Develops stereo audio editing with spatial realism and semantic alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining audio language models with latent diffusion

Decomposing instructions into atomic stereo edit operations

Using synthesized data pipeline for training stereo editing

🔎 Similar Papers

No similar papers found.