🤖 AI Summary
This work addresses the problem of generating multi-minute, multi-track, structurally controllable MIDI music from natural language descriptions, supporting arbitrary time signatures, keys, and post-hoc editing. The proposed framework comprises three core components: (1) a GPT-based autoregressive Transformer with JSON-structured parameter decoding for precise semantic-to-musical-parameter mapping; (2) a music-semantic-aware genetic algorithm featuring emotion-adaptive evolution and a normally distributed dynamic fitness function to ensure harmonic, melodic, and motivic structural coherence and editability; and (3) a time-signature-agnostic, multi-order Markov model for percussion generation, enhancing rhythmic diversity and metrical robustness. Experiments demonstrate significant improvements over state-of-the-art baselines across objective metrics—including harmonic consistency, rhythmic complexity, and structural coherence—as well as in human evaluations. To our knowledge, this is the first approach enabling high-fidelity, cross-style, cross-time-signature, and interactively editable text-to-MIDI generation.
📝 Abstract
This work introduces the M6(GPT)3 composer system, capable of generating complete, multi-minute musical compositions with complex structures in any time signature, in the MIDI domain from input descriptions in natural language. The system utilizes an autoregressive transformer language model to map natural language prompts to composition parameters in JSON format. The defined structure includes time signature, scales, chord progressions, and valence-arousal values, from which accompaniment, melody, bass, motif, and percussion tracks are created. We propose a genetic algorithm for the generation of melodic elements. The algorithm incorporates mutations with musical significance and a fitness function based on normal distribution and predefined musical feature values. The values adaptively evolve, influenced by emotional parameters and distinct playing styles. The system for generating percussion in any time signature utilises probabilistic methods, including Markov chains. Through both human and objective evaluations, we demonstrate that our music generation approach outperforms baselines on specific, musically meaningful metrics, offering a viable alternative to purely neural network-based systems.