🤖 AI Summary
This study addresses the limitations of existing music score image retrieval approaches, which predominantly rely on metadata and lack effective content-driven methods. The work systematically investigates visual features that are effective for score retrieval and introduces, for the first time, a generalizable method for constructing query datasets applicable to any annotated music score corpus. It comparatively evaluates three retrieval paradigms: transcription-dependent approaches based on optical music recognition (OMR), end-to-end transformer models that operate without transcription, and large language models enhanced with textual prompts. Experiments across four diverse music score corpora—varying in size, image quality, and layout style—demonstrate that OMR-based methods achieve superior performance in within-domain retrieval, whereas transcription-free models exhibit greater robustness in cross-domain scenarios.
📝 Abstract
The digitization of musical scores plays a crucial role in their preservation and accessibility, yet information retrieval still depends mainly on metadata searches, such as by title or composer. Content based search in music score images remains underexplored compared to text documents, despite its potential value for musicians, musicologists, and educators. This work contributes to the field by first studying which characteristics of a score are most relevant for search and by defining a systematic method to build query datasets from any annotated corpus. We also consider diverse methods for content-based search on music score images, ranging from transcription-based approaches relying on Optical Music Recognition (OMR), to a transcription-free Transformer model trained to recognize queries directly from score images, and a text-prompted Large Language Model. Our experiments evaluate these models on four corpora exhibiting diverse characteristics in terms of dataset size, image quality, and typesetting mechanisms. Overall, each method excels under different conditions: OMR-based pipelines achieve higher in-domain retrieval, whereas transcription-free models handle domain variability more effectively.