ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current large language models (LLMs) lack systematic evaluation of their capabilities in understanding text-based symbolic music—particularly ABC notation—and following music-related instructions. Method: We introduce ABC-Eval, the first open-source benchmark for symbolic music understanding, comprising 1,086 annotated samples and 10 structured subtasks spanning syntax parsing, rhythmic analysis, key inference, and sequential reasoning. It establishes the first multidimensional evaluation framework tailored to semantic understanding of symbolic music. Evaluation combines human verification with automated metrics across seven state-of-the-art LLMs. Contribution/Results: Our empirical study reveals substantial limitations in LLMs’ modeling of musical semantics; notably, performance is highly consistent across models, confirming ABC-Eval’s robustness and validity. This work fills a critical gap in LLM evaluation for music understanding and provides a reproducible, extensible assessment infrastructure to support future research in AI-driven music intelligence.

Technology Category

Application Category

📝 Abstract

As large language models continue to develop, the feasibility and significance of text-based symbolic music tasks have become increasingly prominent. While symbolic music has been widely used in generation tasks, LLM capabilities in understanding and reasoning about symbolic music remain largely underexplored. To address this gap, we propose ABC-Eval, the first open-source benchmark dedicated to the understanding and instruction-following capabilities in text-based ABC notation scores. It comprises 1,086 test samples spanning 10 sub-tasks, covering scenarios from basic musical syntax comprehension to complex sequence-level reasoning. Such a diverse scope poses substantial challenges to models' ability to handle symbolic music tasks. We evaluated seven state-of-the-art LLMs on ABC-Eval, and the results reveal notable limitations in existing models' symbolic music processing capabilities. Furthermore, the consistent performance of individual baselines across different sub-tasks supports the reliability of our benchmark.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs on symbolic music understanding

Evaluating instruction-following in ABC notation tasks

Assessing music syntax comprehension and complex reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source benchmark for symbolic music understanding

Evaluates instruction-following capabilities in ABC notation

Tests 1,086 samples across 10 diverse musical tasks

🔎 Similar Papers

No similar papers found.