ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

๐Ÿ“… 2026-04-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

188K/year
๐Ÿค– AI Summary
Existing research in multimodal music score processing suffers from fragmented representations and evaluation biases, hindering accurate assessment of modelsโ€™ deep understanding of musical logic. This work proposes ONOTEโ€”the first comprehensive, full-modality benchmark tailored for expert-level music intelligence. By leveraging standardized pitch projection alignment and a deterministic evaluation protocol, ONOTE enables objective cross-notation assessment across diverse score systems (e.g., staff notation, Jianpu) while effectively mitigating the subjectivity and hallucination risks inherent in LLM-as-a-judge approaches. Empirical evaluations reveal a significant disconnect between perceptual accuracy and music-theoretic comprehension in prevailing multimodal models, establishing ONOTE as a reliable benchmark for diagnosing AI reasoning capabilities under complex, rule-constrained musical contexts.

Technology Category

Application Category

๐Ÿ“ Abstract
Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of "LLM-as-a-judge" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a necessary framework for diagnosing reasoning vulnerabilities in complex, rule-constrained domains.
Problem

Research questions and friction points this paper is trying to address.

Omnimodal Notation Processing
music intelligence
notation bias
structural reasoning
multimodal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Omnimodal Notation Processing
ONOTE benchmark
canonical pitch projection
music-theoretic comprehension
notation bias
๐Ÿ”Ž Similar Papers
No similar papers found.