MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional MIDI-to-audio piano synthesis models exhibit limited generalization, struggling to adapt to diverse MIDI sources, musical styles, and recording environments. To address this, we propose the first end-to-end neural codec language model for piano synthesis—built upon the VALLE framework—that jointly encodes discretized MIDI and audio tokens, and incorporates a reference-audio conditioning mechanism to explicitly model演奏 style and enable strong cross-style and cross-device generalization. Trained on a large-scale piano performance dataset, our model achieves over 75% reduction in Fréchet Audio Distance (FAD) on both ATEPP and Maestro benchmarks, and significantly outperforms baselines in subjective listening tests (202:58 preference ratio). Our key contributions are: (i) the first application of neural codec language modeling to end-to-end MIDI-to-audio synthesis; and (ii) a unified architecture delivering both high-fidelity audio reconstruction and robust generalization across heterogeneous inputs and conditions.

Technology Category

Application Category

📝 Abstract
Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model's generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75% lower Frechet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.
Problem

Research questions and friction points this paper is trying to address.

Generating expressive piano audio from music scores
Improving MIDI-to-audio synthesis generalization
Enhancing performance modeling with MIDI and audio tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural codec language model for MIDI-to-audio
Conditions on reference audio and MIDI
Encodes MIDI and audio as discrete tokens
🔎 Similar Papers
No similar papers found.