Recomposer: Event-roll-guided generative audio editing

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address the challenge of fine-grained single-event audio editing in complex soundscapes where acoustic sources overlap temporally, this paper proposes a generative audio editing method operating at millisecond resolution. Methodologically, it introduces a “recomposition” paradigm that jointly leverages text instructions, event-level semantic categories, and millisecond-accurate temporal maps as multimodal editing guidance. The architecture employs a SoundStream-based encoder-decoder Transformer, trained on paired synthetic audio data, and incorporates event-aware convolutional transcription to generate precise temporal supervision. Experiments demonstrate high-fidelity performance across three editing operations—deletion, insertion, and enhancement—in realistic, cluttered acoustic environments. Ablation studies confirm that editing quality is significantly influenced by the specificity of the action type, semantic category, and temporal precision, thereby validating the necessity and effectiveness of multimodal, joint modeling for audio editing.

Technology Category

Application Category

📝 Abstract

Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., ``enhance Door'') and a graphical representation of the event timing derived from an ``event roll'' transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates ``recomposition'' is an important and practical application.

Problem

Research questions and friction points this paper is trying to address.

Editing overlapping sound sources in complex scenes

Using generative models for audio inpainting and enhancement

Integrating textual and graphical event descriptions for editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Event-roll-guided generative audio editing system

Transformer encoder-decoder with SoundStream representations

Textual and graphical event timing descriptions

🔎 Similar Papers

Compositional Audio Representation Learning