๐ค AI Summary
This work addresses the challenge of inaccurate tempo estimation in MIDI transcriptions of solo instrumental audio, which hinders high-fidelity sheet music generation. We propose beat-tracking algorithms tailored to three solo instrumental settings: drum, guitar, and classical piano. Our method introduces the first temporal convolutional network (TCN) architecture specifically designed for solo instrumental audio, augmented by pre-trained transfer learning and a multi-strategy post-processing module incorporating confidence- and periodicity-based constraints. Experiments demonstrate that the customized TCN achieves 99.7% Acc1 on the guitar dataset and improves Acc1 to 50.9% on classical pianoโdoubling the baseline performance. Post-processing further boosts average Acc1 on challenging samples by 12.3%. By significantly enhancing rhythmic structure modeling for single-instrument sources, our approach provides critical tempo and beat information essential for high-quality automatic musical score generation.
๐ Abstract
Recently, automatic music transcription has made it possible to convert musical audio into accurate MIDI. However, the resulting MIDI lacks music notations such as tempo, which hinders its conversion into sheet music. In this paper, we investigate state-of-the-art tempo estimation techniques and evaluate their performance on solo instrumental music. These include temporal convolutional network (TCN) and recurrent neural network (RNN) models that are pretrained on massive of mixed vocals and instrumental music, as well as TCN models trained specifically with solo instrumental performances. Through evaluations on drum, guitar, and classical piano datasets, our TCN models with the new training scheme achieved the best performance. Our newly trained TCN model increases the Acc1 metric by 38.6% for guitar tempo estimation, compared to the pretrained TCN model with an Acc1 of 61.1%. Although our trained TCN model is twice as accurate as the pretrained TCN model in estimating classical piano tempo, its Acc1 is only 50.9%. To improve the performance of deep learning models, we investigate their combinations with various post-processing methods. These post-processing techniques effectively enhance the performance of deep learning models when they struggle to estimate the tempo of specific instruments.