🤖 AI Summary
Symbolic piano music generation lacks scalable, subjective evaluation signals for training. Method: This work proposes the first end-to-end reinforcement learning (RL) fine-tuning of a symbolic music generation model guided by audio-domain aesthetic scores—specifically, Meta Audiobox Aesthetics scores computed on rendered audio—and optimized via Group Relative Policy Optimization (GRPO). Contribution/Results: The fine-tuned model achieves significant improvements in low-level musical feature validity and mean subjective listening ratings (N=14). However, excessive optimization leads to a marked decline in generation diversity, revealing an inherent trade-off between aesthetic quality and diversity in aesthetic-guided fine-tuning. This study establishes a novel cross-modal paradigm for symbolic music generation driven by aesthetic feedback from rendered audio, providing empirical insights into the design and limitations of human-aligned RL objectives in generative music systems.
📝 Abstract
Recent work has proposed training machine learning models to predict aesthetic ratings for music audio. Our work explores whether such models can be used to finetune a symbolic music generation system with reinforcement learning, and what effect this has on the system outputs. To test this, we use group relative policy optimization to finetune a piano MIDI model with Meta Audiobox Aesthetics ratings of audio-rendered outputs as the reward. We find that this optimization has effects on multiple low-level features of the generated outputs, and improves the average subjective ratings in a preliminary listening study with $14$ participants. We also find that over-optimization dramatically reduces diversity of model outputs.