EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited temporal controllability and text-audio alignment in instruction-driven autoregressive audio editing. To this end, we propose a cross-attention modulation framework featuring three attention-editing mechanisms—token replacement, attention reweighting, and diffusion-based refinement—integrating Prompt-to-Prompt principles with autoregressive generation. Our method operates atop the pre-trained MusicGen model to enable fine-grained, text-guided editing. We further introduce the first benchmark specifically designed for prompt-driven audio editing under the autoregressive paradigm. Experiments demonstrate that our approach significantly outperforms diffusion-based baselines in melodic, dynamic, and rhythmic control. Both automated metrics and human evaluation confirm superior text-audio alignment and enhanced audio fidelity.

Technology Category

Application Category

📝 Abstract
In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenced by Auffusion, we extend the model's functionality to support refinement edits, establishing a baseline for prompt-guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre-trained frozen auto-regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly-used music-specific evaluation metrics and a human study, to gauge time-varying controllability, adherence to global text cues, and overall audio realism. The automatic and human evaluations indicate that the proposed combination of prompt-to-prompt guidance with autoregressive generation models significantly outperforms the diffusion-based baseline in terms of melody, dynamics, and tempo of the generated audio. Our code is available at https://github.com/billsioros/EditGen
Problem

Research questions and friction points this paper is trying to address.

Leveraging cross-attention control for efficient audio editing
Extending model functionality to support prompt-guided refinement edits
Improving melody, dynamics, and tempo in generated audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention control for auto-regressive audio editing
Prompt-to-Prompt-like approach with attention mechanisms
Integration of MUSICGEN for attention-based editing
V
Vassilis Sioros
Department of Informatics and Telecommunications, UoA
Alexandros Potamianos
Alexandros Potamianos
National Technical University of Athens
speech processingnatural language processingsignal processingdialogue
G
Giorgos Paraskevopoulos
School of Electrical and Computer Engineering, NTUA