Diffusion Timbre Transfer Via Mutual Information Guided Inpainting

📅 2026-01-03
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of timbre transfer in music audio, aiming to transform the timbre of one instrument into another while preserving the original melody and rhythmic structure. Building upon a pre-trained latent diffusion model, the authors propose a lightweight, training-free inference-time method featuring two key innovations: mutual information-guided dimension-wise noise injection in the latent space and an early reverse diffusion step clamping mechanism. The approach further integrates text/audio conditioning techniques such as CLAP for enhanced control. Experimental results demonstrate that the method achieves high-quality timbre conversion while effectively maintaining musical structure, thereby validating the efficacy of inference-time control strategies for style transfer in audio generation.

Technology Category

Application Category

📝 Abstract
We study timbre transfer as an inference-time editing problem for music audio. Starting from a strong pre-trained latent diffusion model, we introduce a lightweight procedure that requires no additional training: (i) a dimension-wise noise injection that targets latent channels most informative of instrument identity, and (ii) an early-step clamping mechanism that re-imposes the input's melodic and rhythmic structure during reverse diffusion. The method operates directly on audio latents and is compatible with text/audio conditioning (e.g., CLAP). We discuss design choices,analyze trade-offs between timbral change and structural preservation, and show that simple inference-time controls can meaningfully steer pre-trained models for style-transfer use cases.
Problem

Research questions and friction points this paper is trying to address.

timbre transfer
music audio editing
inference-time control
latent diffusion model
structural preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

timbre transfer
latent diffusion model
inference-time editing
mutual information
audio inpainting
🔎 Similar Papers
No similar papers found.
C
Ching Ho Lee
Queen Mary University of London
J
J. Nistal
Sony Computer Science Laboratories, Paris, France
Stefan Lattner
Stefan Lattner
Sony CSL Paris (Music Team)
Deep LearningAudio GenerationAI-assisted Music ProductionMusic Information Retrieval
M
Marco Pasini
Queen Mary University of London, Sony Computer Science Laboratories, Paris, France
George Fazekas
George Fazekas
Reader in Semantic Audio, Queen Mary University of London
semantic audiomusic information retrievalsemantic webmusic emotion recognitiondeep learning