🤖 AI Summary
Traditional piano sustain pedal detection is limited to binary classification, failing to capture the continuous, nuanced depth control inherent in expressive performance and exhibiting poor robustness to variations in room acoustics. This paper introduces the first Transformer-based framework for continuous pedal depth estimation, preserving high-accuracy binary detection while enabling musically semantically plausible fine-grained depth prediction. The method is trained on synthetically generated multi-room reverberant audio and evaluated via leave-one-room-out cross-environment generalization. A quantitative analysis quantifies the impact of reverberation on estimation bias. Experiments demonstrate high accuracy in continuous depth estimation, significantly improving the fidelity of musical expressivity reconstruction. However, reverberation consistently induces systematic overestimation, revealing acoustic generalization as a critical challenge for real-world deployment.
📝 Abstract
Piano sustain pedal detection has previously been approached as a binary on/off classification task, limiting its application in real-world piano performance scenarios where pedal depth significantly influences musical expression. This paper presents a novel approach for high-resolution estimation that predicts continuous pedal depth values. We introduce a Transformer-based architecture that not only matches state-of-the-art performance on the traditional binary classification task but also achieves high accuracy in continuous pedal depth estimation. Furthermore, by estimating continuous values, our model provides musically meaningful predictions for sustain pedal usage, whereas baseline models struggle to capture such nuanced expressions with their binary detection approach. Additionally, this paper investigates the influence of room acoustics on sustain pedal estimation using a synthetic dataset that includes varied acoustic conditions. We train our model with different combinations of room settings and test it in an unseen new environment using a "leave-one-out" approach. Our findings show that the two baseline models and ours are not robust to unseen room conditions. Statistical analysis further confirms that reverberation influences model predictions and introduces an overestimation bias.