Streaming Piano Transcription Based on Consistent Onset and Offset Decoding With Sustain Pedal Detection

📅 2025-03-03
🏛️ International Society for Music Information Retrieval Conference
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in real-time audio-to-MIDI piano transcription: inconsistent modeling of note onsets and offsets, and the lack of explicit modeling of the sustain pedal. We propose a low-latency streaming transcription framework. Methodologically, we design a dual-decoder architecture: a causal Transformer-based onset decoder dedicated to precise onset detection, and a lightweight offset-pedal decoder that jointly predicts note offsets and pedal state to ensure temporal consistency. A convolutional encoder extracts local time-frequency features and incorporates a frame-level pedal classification module. Evaluated on the MAESTRO dataset, our method achieves accuracy comparable to or exceeding state-of-the-art offline approaches, while improving inference speed by 3.2× and reducing memory footprint by 41%. To our knowledge, this is the first end-to-end real-time transcription system capable of simultaneously modeling onsets, offsets, and sustain pedal state with high accuracy and low latency.

Technology Category

Application Category

📝 Abstract
This paper describes a streaming audio-to-MIDI piano transcription approach that aims to sequentially translate a music signal into a sequence of note onset and offset events. The sequence-to-sequence nature of this task may call for the computationally-intensive transformer model for better performance, which has recently been used for offline transcription benchmarks and could be extended for streaming transcription with causal attention mechanisms. We assume that the performance limitation of this naive approach lies in the decoder. Although time-frequency features useful for onset detection are considerably different from those for offset detection, the single decoder is trained to output a mixed sequence of onset and offset events without guarantee of the correspondence between the onset and offset events of the same note. To overcome this limitation, we propose a streaming encoder-decoder model that uses a convolutional encoder aggregating local acoustic features, followed by an autoregressive Transformer decoder detecting a variable number of onset events and another decoder detecting the offset events for the active pitches with validation of the sustain pedal at each time frame. Experiments using the MAESTRO dataset showed that the proposed streaming method performed comparably with or even better than the state-of-the-art offline methods while significantly reducing the computational cost.
Problem

Research questions and friction points this paper is trying to address.

Streaming audio-to-MIDI piano transcription with sequential note event detection.
Overcoming decoder limitations in onset and offset event correspondence.
Reducing computational cost while matching state-of-the-art offline performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming encoder-decoder model for piano transcription
Separate decoders for onset and offset event detection
Sustain pedal validation integrated at each time frame
🔎 Similar Papers
No similar papers found.