🤖 AI Summary
Existing online piano transcription methods exhibit latencies of 128–320 ms—far exceeding the <30 ms threshold required for real-time musical interaction. This work presents the first systematic adaptation of state-of-the-art online transcription models to ultra-low-latency regimes. We propose a causally constrained, lightweight architecture that eliminates all non-causal operations: it employs a shared-parameter causal convolutional backbone, efficient real-time preprocessing, and compact label encoding, while explicitly optimizing the inference latency–accuracy trade-off. Evaluated on the MAESTRO dataset, our system achieves end-to-end latency below 30 ms with competitive transcription accuracy. We further quantify the intrinsic trade-off between preprocessing latency and transcription quality. To foster reproducibility and advancement, we release a fully open-source, benchmarked implementation. This work establishes a new paradigm and practical foundation for low-latency, interactive music applications.
📝 Abstract
Advances in neural network design and the availability of large-scale labeled datasets have driven major improvements in piano transcription. Existing approaches target either offline applications, with no restrictions on computational demands, or online transcription, with delays of 128-320 ms. However, most real-time musical applications require latencies below 30 ms. In this work, we investigate whether and how the current state-of-the-art online transcription model can be adapted for real-time piano transcription. Specifically, we eliminate all non-causal processing, and reduce computational load through shared computations across core model components and variations in model size. Additionally, we explore different pre- and postprocessing strategies, and related label encoding schemes, and discuss their suitability for real-time transcription. Evaluating the adaptions on the MAESTRO dataset, we find a drop in transcription accuracy due to strictly causal processing as well as a tradeoff between the preprocessing latency and prediction accuracy. We release our system as a baseline to support researchers in designing models towards minimum latency real-time transcription.