🤖 AI Summary
This work addresses the problem of timbre transfer in music audio, aiming to transform the timbre of one instrument into another while preserving the original melody and rhythmic structure. Building upon a pre-trained latent diffusion model, the authors propose a lightweight, training-free inference-time method featuring two key innovations: mutual information-guided dimension-wise noise injection in the latent space and an early reverse diffusion step clamping mechanism. The approach further integrates text/audio conditioning techniques such as CLAP for enhanced control. Experimental results demonstrate that the method achieves high-quality timbre conversion while effectively maintaining musical structure, thereby validating the efficacy of inference-time control strategies for style transfer in audio generation.
📝 Abstract
We study timbre transfer as an inference-time editing problem for music audio. Starting from a strong pre-trained latent diffusion model, we introduce a lightweight procedure that requires no additional training: (i) a dimension-wise noise injection that targets latent channels most informative of instrument identity, and (ii) an early-step clamping mechanism that re-imposes the input's melodic and rhythmic structure during reverse diffusion. The method operates directly on audio latents and is compatible with text/audio conditioning (e.g., CLAP). We discuss design choices,analyze trade-offs between timbral change and structural preservation, and show that simple inference-time controls can meaningfully steer pre-trained models for style-transfer use cases.