🤖 AI Summary
Existing violin automatic music transcription methods suffer from the absence of joint modeling of pitch and playing techniques, heavy reliance on manual annotations, and poor generalization. Method: This paper proposes a lightweight end-to-end multi-task model that simultaneously detects note onsets/offsets, estimates pitch, and classifies six canonical playing techniques (e.g., vibrato, bow change, harmonics) within a unified framework. To address the scarcity of real-world labeled data, we introduce MOSA-VPT—a high-fidelity synthetic dataset—and design a physics-informed data augmentation strategy to generate audio with precise technique annotations. Contribution/Results: The model is optimized via joint multi-task training and achieves state-of-the-art performance on real recordings: 89.3% F1-score for technique classification—significantly outperforming prior approaches—while requiring no manual annotation.
📝 Abstract
While automatic music transcription is well-established in music information retrieval, most models are limited to transcribing pitch and timing information from audio, and thus omit crucial expressive and instrument-specific nuances. One example is playing technique on the violin, which affords its distinct palette of timbres for maximal emotional impact. Here, we propose extbf{VioPTT} (Violin Playing Technique-aware Transcription), a lightweight, end-to-end model that directly transcribes violin playing technique in addition to pitch onset and offset. Furthermore, we release extbf{MOSA-VPT}, a novel, high-quality synthetic violin playing technique dataset to circumvent the need for manually labeled annotations. Leveraging this dataset, our model demonstrated strong generalization to real-world note-level violin technique recordings in addition to achieving state-of-the-art transcription performance. To our knowledge, VioPTT is the first to jointly combine violin transcription and playing technique prediction within a unified framework.