A Dataset for the Recognition of Historical and Handwritten Music Scores in Western Notation

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the scarcity of high-quality training data that hinders optical music recognition (OMR) systems when processing authentic historical handwritten scores. To bridge this gap, the authors introduce MusiCorpus, a novel dataset comprising 1,309 pages of predominantly handwritten Western musical notation from memory institutions, offering both MusicXML transcriptions and symbol-level manual annotations. As the largest dataset of its kind to date, MusiCorpus integrates representative real-world samples for the first time, enabling robust training and evaluation of end-to-end and object detection–based OMR systems. This resource significantly advances the machine readability of historical music manuscripts and supports intelligent processing of musical heritage.
📝 Abstract
A large amount of musical heritage has been digitised by memory institutions: libraries, museums, and archives. Nevertheless, the field of Optical Music Recognition (OMR) has struggled with making this music machine-readable, despite advances in deep learning, mostly because no datasets for training systems in realistic conditions were available. The MusiCorpus dataset aims to remedy this situation by providing 1,309 pages of historical sheet music, primarily handwritten, with MusicXML transcriptions and symbol annotations. It is the largest dataset of handwritten music to date and the first dataset containing a realistic and representative sample of musical document collections from memory institutions, suitable for training and evaluating both end-to-end and object detection-based OMR systems and comparing their performance.
Problem

Research questions and friction points this paper is trying to address.

Optical Music Recognition
historical music scores
handwritten music
dataset
machine-readable
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optical Music Recognition
handwritten music scores
MusiCorpus dataset
MusicXML
historical sheet music
🔎 Similar Papers
No similar papers found.
P
Pau Torras
Computer Vision Center, Barcelona, Spain
J
Jiří Mayer
Institute of Formal and Applied Linguistics, Charles University, Prague, Czechia
C
Carles Badal
Department of Arts and Musicology, Universitat Autònoma de Barcelona, Spain
M
Martina Dvořáková
Moravian Library, Brno, Czechia
M
Markéta Herzanová Vlková
Moravian Library, Brno, Czechia
G
Gerard Asbert
Computer Vision Center, Barcelona, Spain
V
Vojtěch Dvořák
Institute of Formal and Applied Linguistics, Charles University, Prague, Czechia
S
Samuel Šomorjai
Moravian Library, Brno, Czechia
Jan Hajič jr.
Jan Hajič jr.
Institute of Formal and Applied Linguistics, Charles University in Prague
Natural Language ProcessingArtificial IntelligenceStructural Bioinformatics
A
Alicia Fornés
Computer Vision Center, Barcelona, Spain