Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the limitations of conventional multi-instrument timbre transfer methods, which typically rely on source separation followed by individual timbre transformation—often introducing artifacts and inter-voice timbral inconsistencies. To overcome these issues, the authors propose MixtureTT, the first end-to-end system for direct timbre transfer in polyphonic audio. MixtureTT employs a multi-voice joint diffusion Transformer that simultaneously models intra-voice content dependencies and inter-voice harmonic relationships within a shared diffusion process, thereby transforming all voices to the target timbre in a unified manner. This joint modeling strategy eliminates cascaded errors inherent in pipeline-based approaches, significantly enhancing timbral consistency while reducing inference costs. Evaluated on an SATB choral dataset, MixtureTT outperforms single-voice baselines in both objective metrics and subjective listening tests, demonstrating the efficacy and superiority of cross-voice joint modeling for polyphonic timbre transfer.

📝 Abstract

Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipelines that propagate source separation artifacts and produce incoherent synthesized timbres across stems. This paper proposes MixtureTT, to the best of our knowledge the first system for flexible per-stem timbre transfer directly from a polyphonic mixture. Given a mixture and a separate timbre reference for each target voice, MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by a factor equal to the number of stems, and yields more coherent multi-stem outputs. Despite operating under a strictly harder input condition, evaluations on the SATB choral dataset show that MixtureTT outperforms single-instrument baselines on both objective and subjective metrics demonstrating the necessity of dedicated multi-instrument timbre transfer over the naive separate-then-transfer pipelines. As a result, this work confirms that the cross-stem modeling is essential for mixture-level timbre transfer as the proposed joint setting consistently exceeds an equivalent single-stem ablation.

Problem

Research questions and friction points this paper is trying to address.

timbre transfer

polyphonic mixtures

multi-instrument

stem coherence

source separation artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

timbre transfer

diffusion model

polyphonic mixture