🤖 AI Summary
Optical Music Recognition (OMR) faces two major challenges: the scarcity of real annotated data and the non-uniqueness of output encodings. This work proposes an end-to-end zero-shot OMR system that integrates high-fidelity synthetic data generation, **kern encoding normalization to enforce a unique representation, and a syntax-constrained decoding mechanism. Leveraging a compact Transformer architecture, the model—comprising 59 million parameters—is trained on a single GPU within six hours and significantly outperforms billion-parameter baselines. It achieves a state-of-the-art OMR-NED score of 18.46% on synthetic music benchmarks and reduces the recognition error rate on historical Polish sheet music to 63.97%. This study is the first to jointly combine data synthesis, encoding normalization, and grammar-aware decoding, effectively addressing the data scarcity and ambiguity bottlenecks in OMR.
📝 Abstract
Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to train a compact 59M-parameter model in just 6 hours on a single GPU that outperforms billion-parameter baselines. Transcoda achieves the best score among state of the art baselines on a newly curated benchmark of synthetically rendered scores at 18.46% OMR-NED (compared to 43.91% for the next-best system, Legato) and reduces the error rate on historical Polish scans to 63.97% OMR-NED (down from 80.16% for SMT++).