Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address the challenge of modeling long-range dependencies across entire piano pieces in automatic transcription—where standard Transformers suffer from quadratic complexity in self-attention—this paper proposes an efficient sparse Transformer architecture. The method introduces (1) sliding-window self-attention in both encoder and decoder to reduce sequence modeling complexity; (2) a hybrid global-local cross-attention mechanism conditioned on MIDI token types, integrating note-category priors with local contextual information; and (3) embedding-level pooling to further compress computational overhead. Evaluated on the MAESTRO dataset, the approach achieves a 2.3× speedup in inference and reduces GPU memory consumption by 41%, while maintaining transcription accuracy—measured by F1-score—within ±0.3% of the full-attention baseline. To our knowledge, this is the first method enabling end-to-end, whole-piece transcription without sacrificing accuracy.

Technology Category

Application Category

📝 Abstract

This paper investigates automatic piano transcription based on computationally-efficient yet high-performant variants of the Transformer that can capture longer-term dependency over the whole musical piece. Recently, transformer-based sequence-to-sequence models have demonstrated excellent performance in piano transcription. These models, however, fail to deal with the whole piece at once due to the quadratic complexity of the self-attention mechanism, and music signals are thus typically processed in a sliding-window manner in practice. To overcome this limitation, we propose an efficient architecture with sparse attention mechanisms. Specifically, we introduce sliding-window self-attention mechanisms for both the encoder and decoder, and a hybrid global-local cross-attention mechanism that attends to various spans according to the MIDI token types. We also use a hierarchical pooling strategy between the encoder and decoder to further reduce computational load. Our experiments on the MAESTRO dataset showed that the proposed model achieved a significant reduction in computational cost and memory usage, accelerating inference speed, while maintaining transcription performance comparable to the full-attention baseline. This allows for training with longer audio contexts on the same hardware, demonstrating the viability of sparse attention for building efficient and high-performance piano transcription systems. The code is available at https://github.com/WX-Wei/efficient-seq2seq-piano-trans.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity of self-attention in Transformers

Enables whole-piece processing instead of sliding windows

Maintains transcription performance while accelerating inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse attention mechanisms reduce computation

Hybrid global-local cross-attention for MIDI tokens

Hierarchical pooling strategy decreases memory usage

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Sr. Multimodal Model Training and Inference Optimization Engineer

TikTok

Seattle, Washington

Authors to Follow