TADA! Tuning Audio Diffusion Models through Activation Steering

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited understanding of how current audio diffusion models internally represent high-level musical semantics—such as instruments, vocals, style, rhythm, and emotion—and the consequent lack of precise control over generation. By applying activation patching to identify critical attention layers and combining contrastive activation addition with sparse autoencoders, we reveal—for the first time—the existence of shared yet functionally specialized subsets of attention heads that explicitly encode distinct high-level musical concepts. Leveraging this insight, we demonstrate high-precision intervention and editing of specific musical elements in generated audio, substantially enhancing both the interpretability and controllability of diffusion-based audio synthesis.

Technology Category

Application Category

📝 Abstract
Audio diffusion models can synthesize high-fidelity music from text, yet their internal mechanisms for representing high-level concepts remain poorly understood. In this work, we use activation patching to demonstrate that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small, shared subset of attention layers in state-of-the-art audio diffusion architectures. Next, we demonstrate that applying Contrastive Activation Addition and Sparse Autoencoders in these layers enables more precise control over the generated audio, indicating a direct benefit of the specialization phenomenon. By steering activations of the identified layers, we can alter specific musical elements with high precision, such as modulating tempo or changing a track's mood.
Problem

Research questions and friction points this paper is trying to address.

audio diffusion models
semantic musical concepts
activation steering
high-level representation
precise control
Innovation

Methods, ideas, or system contributions that make the work stand out.

activation steering
audio diffusion models
activation patching
contrastive activation addition
sparse autoencoders
🔎 Similar Papers
No similar papers found.