🤖 AI Summary
To address the challenge of modeling long-range dependencies across entire piano pieces in automatic transcription—where standard Transformers suffer from quadratic complexity in self-attention—this paper proposes an efficient sparse Transformer architecture. The method introduces (1) sliding-window self-attention in both encoder and decoder to reduce sequence modeling complexity; (2) a hybrid global-local cross-attention mechanism conditioned on MIDI token types, integrating note-category priors with local contextual information; and (3) embedding-level pooling to further compress computational overhead. Evaluated on the MAESTRO dataset, the approach achieves a 2.3× speedup in inference and reduces GPU memory consumption by 41%, while maintaining transcription accuracy—measured by F1-score—within ±0.3% of the full-attention baseline. To our knowledge, this is the first method enabling end-to-end, whole-piece transcription without sacrificing accuracy.
📝 Abstract
This paper investigates automatic piano transcription based on computationally-efficient yet high-performant variants of the Transformer that can capture longer-term dependency over the whole musical piece. Recently, transformer-based sequence-to-sequence models have demonstrated excellent performance in piano transcription. These models, however, fail to deal with the whole piece at once due to the quadratic complexity of the self-attention mechanism, and music signals are thus typically processed in a sliding-window manner in practice. To overcome this limitation, we propose an efficient architecture with sparse attention mechanisms. Specifically, we introduce sliding-window self-attention mechanisms for both the encoder and decoder, and a hybrid global-local cross-attention mechanism that attends to various spans according to the MIDI token types. We also use a hierarchical pooling strategy between the encoder and decoder to further reduce computational load. Our experiments on the MAESTRO dataset showed that the proposed model achieved a significant reduction in computational cost and memory usage, accelerating inference speed, while maintaining transcription performance comparable to the full-attention baseline. This allows for training with longer audio contexts on the same hardware, demonstrating the viability of sparse attention for building efficient and high-performance piano transcription systems. The code is available at https://github.com/WX-Wei/efficient-seq2seq-piano-trans.