๐ค AI Summary
In symbolic music generation, fine-grained tokenization ensures high-fidelity modeling at the cost of computational efficiency, whereas compact (compound) tokenization improves decoding speed but hinders intra-token dependency modeling. To address this trade-off, we propose a dynamic positional (DP) token scheduling mechanism that autoregressively unfolds compound tokens during decoding, enabling explicit modeling of internal token structure without additional parameters. DP integrates seamlessly into existing representation frameworks via a delayed scheduling strategy. Experiments on a symphonic MIDI dataset demonstrate that our method significantly enhances musical structural coherence and generation quality over standard compound tokenization, while substantially narrowing the performance gap with fine-grained tokenization. Crucially, DP achieves this improvement without compromising decoding efficiencyโthus reconciling high-fidelity modeling with scalable inference.
๐ Abstract
Symbolic music generation faces a fundamental trade-off between efficiency and quality. Fine-grained tokenizations achieve strong coherence but incur long sequences and high complexity, while compact tokenizations improve efficiency at the expense of intra-token dependencies. To address this, we adapt a delay-based scheduling mechanism (DP) that expands compound-like tokens across decoding steps, enabling autoregressive modeling of intra-token dependencies while preserving efficiency. Notably, DP is a lightweight strategy that introduces no additional parameters and can be seamlessly integrated into existing representations. Experiments on symbolic orchestral MIDI datasets show that our method improves all metrics over standard compound tokenizations and narrows the gap to fine-grained tokenizations.