SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing

๐Ÿ“… 2025-09-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current symbolic music Transformer models face a fundamental trade-off between inference speed and generation quality in real-time improvisation and human-AI co-creation: embedding pooling and other acceleration techniques degrade musicality, while Byte-Pair Encoding (BPE) exhibits markedly reduced generalization in multi-track settings. This paper presents the first systematic investigation of BPEโ€™s applicability to multi-track symbolic music. We propose Attribute-Specific Key-Value Head Sharing (AS-KVHS), a novel mechanism that enables efficient autoregressive generation within the Transformer architecture by sharing key-value projections across attribute-specific attention heads, integrated with a structured symbolic representation. Evaluated on the SAGE-Music benchmark, our method achieves ~30% inference speedup with only a marginal 0.4% drop in objective metricsโ€”and, notably, improves subjective musical quality. This significantly alleviates the quality-efficiency trade-off in real-time composition. We publicly release both the model and the benchmark.

Technology Category

Application Category

๐Ÿ“ Abstract
Low-latency symbolic music generation is essential for real-time improvisation and human-AI co-creation. Existing transformer-based models, however, face a trade-off between inference speed and musical quality. Traditional acceleration techniques such as embedding pooling significantly degrade quality, while recently proposed Byte Pair Encoding (BPE) methods - though effective on single-track piano data - suffer large performance drops in multi-track settings, as revealed by our analysis. We propose Attribute-Specialized Key-Value Head Sharing (AS-KVHS), adapted to music's structured symbolic representation, achieving about 30% inference speedup with only a negligible (about 0.4%) quality drop in objective evaluations and slight improvements in subjective listening tests. Our main contributions are (1) the first systematic study of BPE's generalizability in multi-track symbolic music, and (2) the introduction of AS-KVHS for low-latency symbolic music generation. Beyond these, we also release SAGE-Music, an open-source benchmark that matches or surpasses state-of-the-art models in generation quality.
Problem

Research questions and friction points this paper is trying to address.

Addressing speed-quality trade-off in transformer music generation
Solving performance degradation in multi-track BPE methods
Enabling low-latency symbolic music for real-time applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attribute-Specialized Key-Value Head Sharing technique
Adapted to structured symbolic music representation
Achieves 30% speedup with negligible quality drop
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jiaye Tan
University of Michigan, Ann Arbor, MI, USA
Haonan Luo
Haonan Luo
University of Michigan, Ann Arbor, MI, USA
L
Linfeng Song
University of Pennsylvania, Philadelphia, PA, USA
S
Shuaiqi Chen
University of Waterloo, Waterloo, ON, Canada
Y
Yishan Lyu
University of Michigan, Ann Arbor, MI, USA
Z
Zian Zhong
University of Michigan, Ann Arbor, MI, USA
R
Roujia Wang
University of Michigan, Ann Arbor, MI, USA
Daniel Jiang
Daniel Jiang
Carnegie Mellon University
H
Haoran Zhang
University of Michigan, Ann Arbor, MI, USA
J
Jiaming Bai
University of Chinese Academy of Social Sciences, Beijing, China
Haoran Cheng
Haoran Cheng
Zhejiang University
Deep LearningComputer Vision
Q
Q. Vera Liao
University of Michigan, Ann Arbor, MI, USA
Hao-Wen Dong
Hao-Wen Dong
University of Michigan
Music GenerationMusic TechnologyAudio SynthesisVideo EditingMultimodal AI