Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Music multimodal translation (e.g., sheet music images ↔ symbolic scores ↔ audio) has long relied on task-specific, unidirectional models. This paper introduces the first unified cross-modal bidirectional translation framework enabling end-to-end sequence-to-sequence conversion among sheet music images, MusicXML/MIDI, and performance audio. Methodologically, we curate a large-scale paired audio–image dataset (1,300+ hours of YouTube videos), design a cross-modal unified tokenization scheme, and adopt a Transformer encoder–decoder architecture for multi-task joint training. Experiments demonstrate state-of-the-art performance: optical music recognition achieves a symbol error rate of 13.67%; the framework significantly outperforms single-task baselines across automatic music transcription, optical music recognition, and image-conditioned audio generation. Notably, it enables, for the first time, high-fidelity audio synthesis directly conditioned on sheet music images.

Technology Category

Application Category

📝 Abstract
Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio-to-MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual translation tasks. In this paper, we propose a unified approach, where we train a general-purpose model on many translation tasks simultaneously. Two key factors make this unified approach viable: a new large-scale dataset and the tokenization of each modality. Firstly, we propose a new dataset that consists of more than 1,300 hours of paired audio-score image data collected from YouTube videos, which is an order of magnitude larger than any existing music modal translation datasets. Secondly, our unified tokenization framework discretizes score images, audio, MIDI, and MusicXML into a sequence of tokens, enabling a single encoder-decoder Transformer to tackle multiple cross-modal translation as one coherent sequence-to-sequence task. Experimental results confirm that our unified multitask model improves upon single-task baselines in several key areas, notably reducing the symbol error rate for optical music recognition from 24.58% to a state-of-the-art 13.67%, while similarly substantial improvements are observed across the other translation tasks. Notably, our approach achieves the first successful score-image-conditioned audio generation, marking a significant breakthrough in cross-modal music generation.
Problem

Research questions and friction points this paper is trying to address.

Unified translation across score images, symbolic music, and audio modalities
Large-scale dataset and tokenization enable general-purpose multimodal model
Improving accuracy and enabling novel cross-modal music generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model for cross-modal music translation
Large-scale paired audio-score image dataset
Tokenization framework for multiple music modalities
🔎 Similar Papers
No similar papers found.
J
Jongmin Jung
Department of Artificial Intelligence, Sogang University, Seoul, South Korea
D
Dongmin Kim
Department of Artificial Intelligence, Sogang University, Seoul, South Korea
Sihun Lee
Sihun Lee
Sogang University
Machine LearningMusic Information RetrievalComputational MusicologyArtificial Intelligence
S
Seola Cho
Sogang Future Lab, Sogang University, Seoul, South Korea
H
Hyungjoon Soh
Department of Physics Education, Seoul National University, Seoul, South Korea
Irmak Bukey
Irmak Bukey
PhD Student, Carnegie Mellon University
Machine Learning for MusicAudio Signal ProcessingMusic Information Retrieval
Chris Donahue
Chris Donahue
Assistant Professor, CMU CSD; Research Scientist, Google DeepMind (part time)
Music AIAudio MLMusic information retrievalComputer music
Dasaem Jeong
Dasaem Jeong
Sogang University
Music Information RetrievalExpressive Performance ModelingMachine Learning