Automatic Music Mixing using a Generative Model of Effect Embeddings

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automated mixing methods formulate mixing as a deterministic regression task, overlooking its inherent subjectivity and solution multiplicity. This paper proposes MEGAMI, the first framework to model music mixing as a conditional distribution generation problem. It employs permutation-equivariant neural networks to achieve track-agnostic and order-robust mixing, and introduces a domain adaptation mechanism to jointly train on dry and wet recordings. The framework integrates conditional effect processors and generative effect embeddings, enabling flexible input of arbitrarily many unlabeled tracks in any order. Experiments demonstrate that MEGAMI significantly outperforms prior methods in distributional fidelity—quantified via statistical divergence metrics—and achieves subjective audio quality approaching professional human mixes. Moreover, it exhibits strong generalization across diverse musical genres, validating its robustness and practical applicability.

Technology Category

Application Category

📝 Abstract
Music mixing involves combining individual tracks into a cohesive mixture, a task characterized by subjectivity where multiple valid solutions exist for the same input. Existing automatic mixing systems treat this task as a deterministic regression problem, thus ignoring this multiplicity of solutions. Here we introduce MEGAMI (Multitrack Embedding Generative Auto MIxing), a generative framework that models the conditional distribution of professional mixes given unprocessed tracks. MEGAMI uses a track-agnostic effects processor conditioned on per-track generated embeddings, handles arbitrary unlabeled tracks through a permutation-equivariant architecture, and enables training on both dry and wet recordings via domain adaptation. Our objective evaluation using distributional metrics shows consistent improvements over existing methods, while listening tests indicate performances approaching human-level quality across diverse musical genres.
Problem

Research questions and friction points this paper is trying to address.

Models the conditional distribution of professional music mixes
Handles arbitrary unlabeled tracks through permutation-equivariant architecture
Enables training on both dry and wet recordings via domain adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative framework models professional mix distributions
Permutation-equivariant architecture handles unlabeled track inputs
Domain adaptation enables training on varied recording types
🔎 Similar Papers
No similar papers found.
E
Eloi Moliner
Acoustics Lab, DICE, Aalto University, Espoo, Finland
M
Marco A. Martínez-Ramírez
Sony AI
Junghyun Koo
Junghyun Koo
Sony AI / Sony Research
Intelligent Music ProductionControllable Generative ModelsSource Separation
W
Wei-Hsiang Liao
Sony AI
K
K. Cheuk
Sony AI
Joan Serrà
Joan Serrà
Sony AI
Representation LearningGenerative ModelsMachine ListeningMusic Information Retrieval
V
Vesa Valimaki
Acoustics Lab, DICE, Aalto University, Espoo, Finland
Yuki Mitsufuji
Yuki Mitsufuji
Distinguished Engineer, Sony
Machine LearningAudioSource SeparationMusic TechnologySpatial Audio