MAGE: Modality-Agnostic Music Generation and Editing

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing cross-modal music systems exhibit fragility under ambiguous, missing, or temporally misaligned prompts and struggle to jointly handle generation and editing tasks. This work proposes a unified cross-modal framework for music generation and editing based on a continuous latent space, introducing a controllable multimodal streaming Transformer that learns synthesis and editing trajectories conditioned on arbitrary subsets of input modalities. Key innovations include an audio-visual tether alignment mechanism to enforce temporal consistency, cross-gated modulation to suppress unsupported audio components, and dynamic modality masking with curriculum training to enhance robustness to missing modalities. For the first time, this approach enables seamless integration of high-quality generation and multitrack editing within a single model, significantly outperforming existing methods on the MUSIC benchmark and achieving state-of-the-art performance in audio fidelity, controllability, and interface flexibility.

Technology Category

Application Category

📝 Abstract

Multimodal music creation requires models that can both generate audio from high-level cues and edit existing mixtures in a targeted manner. Yet most multimodal music systems are built for a single task and a fixed prompting interface, making their conditioning brittle when guidance is ambiguous, temporally misaligned, or partially missing. Common additive fusion or feature concatenation further weakens cross-modal grounding, often causing prompt drift and spurious musical content during generation and editing. We propose MAGE, a modality-agnostic framework that unifies multimodal music generation and mixture-grounded editing within a single continuous latent formulation. At its core, MAGE uses a Controlled Multimodal FluxFormer, a flow-based Transformer that learns controllable latent trajectories for synthesis and editing under any available subset of conditions. To improve grounding, we introduce Audio-Visual Nexus Alignment to select temporally consistent visual evidence for the audio timeline, and a cross-gated modulation mechanism that applies multiplicative control from aligned visual and textual cues to the audio latents, suppressing unsupported components rather than injecting them. Finally, we train with a dynamic modality-masking curriculum that exposes the model to text-only, visual-only, joint multimodal, and mixture-guided settings, enabling robust inference under missing modalities without training separate models. Experiments on the MUSIC benchmark show that MAGE supports effective multimodal-guided music generation and targeted editing, achieving competitive quality while offering a lightweight and flexible interface tailored to practical music workflows.

Problem

Research questions and friction points this paper is trying to address.

multimodal music generation

mixture editing

modality alignment

prompt drift

missing modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

modality-agnostic

multimodal music generation

mixture-guided editing