🤖 AI Summary
This work addresses the challenging task of generating expressive and acoustically realistic piano audio performances from symbolic sheet music. We propose the first end-to-end system that seamlessly integrates a Transformer-based Expressive Performance Rendering (EPR) module with a fine-tuned neural MIDI synthesizer, and—crucially—introduces environment-aware audio synthesis (e.g., concert-hall reverberation) for the first time. Trained on a subset of the ATEPP dataset, the system is evaluated using both objective metrics and subjective listening tests. It achieves state-of-the-art performance in expressive accuracy, audio fidelity, and spatial ambiance reproduction, significantly outperforming baselines in subjective evaluation. Our key contributions are: (1) the first unified framework jointly optimizing expressive performance modeling and physical acoustic realism; (2) a novel environment-aware neural MIDI-to-audio synthesis paradigm; and (3) an open-source, fully reproducible high-performance pipeline for expressive piano synthesis.
📝 Abstract
This paper presents an integrated system that transforms symbolic music scores into expressive piano performance audio. By combining a Transformer-based Expressive Performance Rendering (EPR) model with a fine-tuned neural MIDI synthesiser, our approach directly generates expressive audio performances from score inputs. To the best of our knowledge, this is the first system to offer a streamlined method for converting score MIDI files lacking expression control into rich, expressive piano performances. We conducted experiments using subsets of the ATEPP dataset, evaluating the system with both objective metrics and subjective listening tests. Our system not only accurately reconstructs human-like expressiveness, but also captures the acoustic ambience of environments such as concert halls and recording studios. Additionally, the proposed system demonstrates its ability to achieve musical expressiveness while ensuring good audio quality in its outputs.