Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Optical Music Recognition (OMR) faces two major challenges: the scarcity of real annotated data and the non-uniqueness of output encodings. This work proposes an end-to-end zero-shot OMR system that integrates high-fidelity synthetic data generation, **kern encoding normalization to enforce a unique representation, and a syntax-constrained decoding mechanism. Leveraging a compact Transformer architecture, the model—comprising 59 million parameters—is trained on a single GPU within six hours and significantly outperforms billion-parameter baselines. It achieves a state-of-the-art OMR-NED score of 18.46% on synthetic music benchmarks and reduces the recognition error rate on historical Polish sheet music to 63.97%. This study is the first to jointly combine data synthesis, encoding normalization, and grammar-aware decoding, effectively addressing the data scarcity and ambiguity bottlenecks in OMR.

📝 Abstract

Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to train a compact 59M-parameter model in just 6 hours on a single GPU that outperforms billion-parameter baselines. Transcoda achieves the best score among state of the art baselines on a newly curated benchmark of synthetically rendered scores at 18.46% OMR-NED (compared to 43.91% for the next-best system, Legato) and reduces the error rate on historical Polish scans to 63.97% OMR-NED (down from 80.16% for SMT++).

Problem

Research questions and friction points this paper is trying to address.

Optical Music Recognition

data scarcity

non-uniqueness

encoding ambiguity

annotated datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optical Music Recognition

synthetic data generation

kern normalization