Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing symbolic music generation models treat note attributes as a unidirectional sequential dependency, yet empirical evidence reveals no strict temporal or hierarchical constraints among them—attributes are inherently unordered and concurrent. Method: We propose Amadeus—a hybrid framework that decouples sequence and attribute generation. It employs an autoregressive model to generate the note sequence backbone, while a bidirectional discrete diffusion model concurrently models all note attributes. To enhance representation learning, we introduce MLSDES (Music Latent Space Discriminability Enhancement via Contrastive Learning) and CIEM (Attention-Driven Conditional Information Enhancement Module). Contribution/Results: Evaluated on AMD—the largest open-source MIDI dataset—we demonstrate that Amadeus surpasses state-of-the-art methods in both unconditional and text-conditioned generation across multiple metrics. It achieves ≥4× faster inference and supports training-free, fine-grained attribute control.

Technology Category

Application Category

📝 Abstract
Existing state-of-the-art symbolic music generation models predominantly adopt autoregressive or hierarchical autoregressive architectures, modelling symbolic music as a sequence of attribute tokens with unidirectional temporal dependencies, under the assumption of a fixed, strict dependency structure among these attributes. However, we observe that using different attributes as the initial token in these models leads to comparable performance. This suggests that the attributes of a musical note are, in essence, a concurrent and unordered set, rather than a temporally dependent sequence. Based on this insight, we introduce Amadeus, a novel symbolic music generation framework. Amadeus adopts a two-level architecture: an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes. To enhance performance, we propose Music Latent Space Discriminability Enhancement Strategy(MLSDES), incorporating contrastive learning constraints that amplify discriminability of intermediate music representations. The Conditional Information Enhancement Module (CIEM) simultaneously strengthens note latent vector representation via attention mechanisms, enabling more precise note decoding. We conduct extensive experiments on unconditional and text-conditioned generation tasks. Amadeus significantly outperforms SOTA models across multiple metrics while achieving at least 4$ imes$ speed-up. Furthermore, we demonstrate training-free, fine-grained note attribute control feasibility using our model. To explore the upper performance bound of the Amadeus architecture, we compile the largest open-source symbolic music dataset to date, AMD (Amadeus MIDI Dataset), supporting both pre-training and fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Modeling concurrent unordered note attributes instead of fixed sequential dependencies
Enhancing symbolic music generation performance and generation speed
Enabling training-free fine-grained control over musical note attributes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional diffusion model for unordered note attributes
Contrastive learning enhances music representation discriminability
Attention mechanism strengthens conditional note decoding
🔎 Similar Papers
No similar papers found.
H
Hongju Su
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
K
Ke Li
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
Lan Yang
Lan Yang
Edwin & Florence Skinner Professor, Electrical & Systems Engineering, Washington Univ. in St Louis
resonatorlasernonlinear opticssensingnon-Hermitian physics
H
Honggang Zhang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
Yi-Zhe Song
Yi-Zhe Song
SketchX Lab, CVSSP, University of Surrey
Computer VisionComputer GraphicsMachine LearningArtificial Intelligence