🤖 AI Summary
Traditional MIDI-to-audio piano synthesis models exhibit limited generalization, struggling to adapt to diverse MIDI sources, musical styles, and recording environments. To address this, we propose the first end-to-end neural codec language model for piano synthesis—built upon the VALLE framework—that jointly encodes discretized MIDI and audio tokens, and incorporates a reference-audio conditioning mechanism to explicitly model演奏 style and enable strong cross-style and cross-device generalization. Trained on a large-scale piano performance dataset, our model achieves over 75% reduction in Fréchet Audio Distance (FAD) on both ATEPP and Maestro benchmarks, and significantly outperforms baselines in subjective listening tests (202:58 preference ratio). Our key contributions are: (i) the first application of neural codec language modeling to end-to-end MIDI-to-audio synthesis; and (ii) a unified architecture delivering both high-fidelity audio reconstruction and robust generalization across heterogeneous inputs and conditions.
📝 Abstract
Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model's generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75% lower Frechet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.