🤖 AI Summary
Current end-to-end automatic music generation systems lack support for iterative human–machine interaction, hindering computer-assisted composition. This paper proposes a personalized, multi-track, long-context symbolic music infilling method designed for human–AI co-creation on edge devices, enabling low-latency and high-consistency real-time collaboration. Our approach introduces three key contributions: (1) MIDI-RWKV, the first RWKV-7 linear-architecture-based model tailored for efficient modeling of multi-track MIDI sequences; (2) a state-level initialization fine-tuning strategy that achieves personalized adaptation with minimal training samples; and (3) a lightweight edge deployment framework. Quantitative and qualitative evaluations demonstrate significant improvements over baselines in musical coherence, expressiveness, and responsiveness. The model, training code, and inference toolkit are publicly released to ensure reproducibility and facilitate personalized music completion applications.
📝 Abstract
Existing work in automatic music generation has primarily focused on end-to-end systems that produce complete compositions or continuations. However, because musical composition is typically an iterative process, such systems make it difficult to engage in the back-and-forth between human and machine that is essential to computer-assisted creativity. In this study, we address the task of personalizable, multi-track, long-context, and controllable symbolic music infilling to enhance the process of computer-assisted composition. We present MIDI-RWKV, a novel model based on the RWKV-7 linear architecture, to enable efficient and coherent musical cocreation on edge devices. We also demonstrate that MIDI-RWKV admits an effective method of finetuning its initial state for personalization in the very-low-sample regime. We evaluate MIDI-RWKV and its state tuning on several quantitative and qualitative metrics, and release model weights and code at https://github.com/christianazinn/MIDI-RWKV.