Stemphonic: All-at-once Flexible Multi-stem Music Generation

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the longstanding trade-off between flexibility and inference efficiency in music stem generation: fixed-structure approaches lack compositional freedom, while sequential per-stem methods suffer from slow generation. We propose a unified diffusion/flow-matching framework that enables synchronous, single-pass generation of an arbitrary number of high-quality, time-aligned stems by sharing a common initial latent noise and conditioning each stem on its dedicated text prompt. Our method is the first to achieve efficient parallel generation with variable stem counts, supporting fine-grained control over stem activity and conditional synthesis, thereby significantly enhancing creative flexibility. Experiments demonstrate superior generation quality across multiple open-source stem datasets and a 25%–50% speedup in full-mix inference compared to existing approaches.

Technology Category

Application Category

📝 Abstract

Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.

Problem

Research questions and friction points this paper is trying to address.

music stem generation

multi-stem synthesis

flexible music generation

synchronized audio generation

efficient inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-stem generation

diffusion model

synchronized audio synthesis