SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the imbalance between structural control and multi-track orchestration complexity in symphonic music generation by proposing a three-dimensional hierarchical generative framework. Through a bar–voice–event tri-level decoupled modeling approach, the method enables fine-grained, controllable long-sequence generation. Key innovations include a beat-quantized harmonic skeleton guidance mechanism, a three-dimensional cascaded decoding architecture, a Group Relative Policy Optimization (GRPO) reinforcement learning strategy, and a dissonance-avoiding sampling algorithm, all optimized with a cross-modal audio-perception reward. Experimental results demonstrate significant improvements in harmonic clarity on objective metrics, while subjective evaluations show superior musicality and user preference compared to existing baselines.
📝 Abstract
Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We present SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestration. SymphonyGen employs a cascading decoder architecture that decomposes the Bar, Track, and Event axes, improving computational efficiency and scalability over conventional 1D or 2D models. We introduce "short-score" conditioning via a beat-quantized multi-voice harmony skeleton, enabling outline control while preserving textural diversity. The model is further refined using Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward, aligning symbolic output with modern acoustic expectations. Additionally, we implement a dissonance-averse sampling algorithm to suppress unintended tonal clashes during inference. Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression. Subjective evaluations demonstrate that SymphonyGen outperforms baselines in musicality and preference for orchestral music generation. Demo page: https://symphonygen.github.io/
Problem

Research questions and friction points this paper is trying to address.

symphonic music generation
complexity-control imbalance
orchestration
hierarchical structure
harmonic control
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D hierarchical generation
harmony skeleton conditioning
Group Relative Policy Optimization (GRPO)
dissonance-averse sampling
symbolic orchestral music generation
X
Xuzheng He
Department of AI Music and Music Information Technology, Central Conservatory of Music
N
Nan Nan
Frontier Institute of Science and Technology, and Interdisciplinary Research Center of Frontier Science and Technology, Xi’an Jiaotong University
Zhilin Wang
Zhilin Wang
University of Science and Technology of China
Language ModelReinforcement LearningAI4Music
Z
Ziyue Kang
Frontier Institute of Science and Technology, and Interdisciplinary Research Center of Frontier Science and Technology, Xi’an Jiaotong University
Z
Zhuoru Mo
Shenzhen University
A
Ao Li
Frontier Institute of Science and Technology, and Interdisciplinary Research Center of Frontier Science and Technology, Xi’an Jiaotong University
Y
Yu Pan
Department of AI Music and Music Information Technology, Central Conservatory of Music
Xiaobing Li
Xiaobing Li
University of Wisconsin-Madison SUNY College of Optometry
saccade attention decision making
Feng Yu
Feng Yu
University of Exeter
Efficient AIContinual LearningFederated LearningFoundation Model
X
Xiaohong Guan
Department of AI Music and Music Information Technology, Central Conservatory of Music