🤖 AI Summary
Symbolic music generation struggles with insufficient long-term structural coherence and controllability in motive development. To address this, we propose a three-stage controllable generation framework: phrase generation, phrase refinement, and motive-driven dynamic phrase selection—enabling progressive evolution from an initial motive to a structured melody. Methodologically, we design a corruption-refinement self-supervised training strategy, develop a hierarchical Transformer-based architecture, and introduce a motive smoothness metric to quantitatively assess long-range consistency. The framework offers semi-interpretable representations and supports interactive editing. Experiments on short-sequence datasets demonstrate significant improvements over state-of-the-art Transformer models. Generated melodies exhibit enhanced motive coherence, improved long-range structural plausibility, and greater user controllability.
📝 Abstract
Transformer models have made great strides in generating symbolically represented music with local coherence. However, controlling the development of motifs in a structured way with global form remains an open research area. One of the reasons for this challenge is due to the note-by-note autoregressive generation of such models, which lack the ability to correct themselves after deviations from the motif. In addition, their structural performance on datasets with shorter durations has not been studied in the literature. In this study, we propose Yin-Yang, a framework consisting of a phrase generator, phrase refiner, and phrase selector models for the development of motifs into melodies with long-term structure and controllability. The phrase refiner is trained on a novel corruption-refinement strategy which allows it to produce melodic and rhythmic variations of an original motif at generation time, thereby rectifying deviations of the phrase generator. We also introduce a new objective evaluation metric for quantifying how smoothly the motif manifests itself within the piece. Evaluation results show that our model achieves better performance compared to state-of-the-art transformer models while having the advantage of being controllable and making the generated musical structure semi-interpretable, paving the way for musical analysis. Our code and demo page can be found at https://github.com/keshavbhandari/yinyang.