🤖 AI Summary
Existing symbolic music generation models treat note attributes as a unidirectional sequential dependency, yet empirical evidence reveals no strict temporal or hierarchical constraints among them—attributes are inherently unordered and concurrent. Method: We propose Amadeus—a hybrid framework that decouples sequence and attribute generation. It employs an autoregressive model to generate the note sequence backbone, while a bidirectional discrete diffusion model concurrently models all note attributes. To enhance representation learning, we introduce MLSDES (Music Latent Space Discriminability Enhancement via Contrastive Learning) and CIEM (Attention-Driven Conditional Information Enhancement Module). Contribution/Results: Evaluated on AMD—the largest open-source MIDI dataset—we demonstrate that Amadeus surpasses state-of-the-art methods in both unconditional and text-conditioned generation across multiple metrics. It achieves ≥4× faster inference and supports training-free, fine-grained attribute control.
📝 Abstract
Existing state-of-the-art symbolic music generation models predominantly adopt autoregressive or hierarchical autoregressive architectures, modelling symbolic music as a sequence of attribute tokens with unidirectional temporal dependencies, under the assumption of a fixed, strict dependency structure among these attributes. However, we observe that using different attributes as the initial token in these models leads to comparable performance. This suggests that the attributes of a musical note are, in essence, a concurrent and unordered set, rather than a temporally dependent sequence. Based on this insight, we introduce Amadeus, a novel symbolic music generation framework. Amadeus adopts a two-level architecture: an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes. To enhance performance, we propose Music Latent Space Discriminability Enhancement Strategy(MLSDES), incorporating contrastive learning constraints that amplify discriminability of intermediate music representations. The Conditional Information Enhancement Module (CIEM) simultaneously strengthens note latent vector representation via attention mechanisms, enabling more precise note decoding. We conduct extensive experiments on unconditional and text-conditioned generation tasks. Amadeus significantly outperforms SOTA models across multiple metrics while achieving at least 4$ imes$ speed-up. Furthermore, we demonstrate training-free, fine-grained note attribute control feasibility using our model. To explore the upper performance bound of the Amadeus architecture, we compile the largest open-source symbolic music dataset to date, AMD (Amadeus MIDI Dataset), supporting both pre-training and fine-tuning.