Exploring GPT's Ability as a Judge in Music Understanding

📅 2025-01-22

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This study systematically evaluates the discriminative capabilities of large language models (LLMs) on music information understanding tasks—specifically beat tracking, chord extraction, and key estimation—with a focus on identifying annotation errors. Method: We propose a concept-enhanced evaluation framework grounded in symbolic music representations and systematic prompt engineering, enabling the first quantitative analysis of LLMs’ consistency in responding to musical concepts embedded in prompts. Contribution/Results: Experiments on GPT-4 under zero-shot and few-shot settings achieve error detection accuracies of 65.20%, 64.80%, and 59.72%, respectively—significantly surpassing random baselines. We demonstrate a strong positive correlation between conceptual informativeness in prompts and detection accuracy. Our findings validate LLMs as lightweight, training-free discriminators for music understanding, offering a novel paradigm for music information retrieval and annotation quality assessment.

Technology Category

Application Category

📝 Abstract

Recent progress in text-based Large Language Models (LLMs) and their extended ability to process multi-modal sensory data have led us to explore their applicability in addressing music information retrieval (MIR) challenges. In this paper, we use a systematic prompt engineering approach for LLMs to solve MIR problems. We convert the music data to symbolic inputs and evaluate LLMs' ability in detecting annotation errors in three key MIR tasks: beat tracking, chord extraction, and key estimation. A concept augmentation method is proposed to evaluate LLMs' music reasoning consistency with the provided music concepts in the prompts. Our experiments tested the MIR capabilities of Generative Pre-trained Transformers (GPT). Results show that GPT has an error detection accuracy of 65.20%, 64.80%, and 59.72% in beat tracking, chord extraction, and key estimation tasks, respectively, all exceeding the random baseline. Moreover, we observe a positive correlation between GPT's error finding accuracy and the amount of concept information provided. The current findings based on symbolic music input provide a solid ground for future LLM-based MIR research.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Music Information Recognition

Music Information Retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Music Information Retrieval

Symbolic Representations

🔎 Similar Papers

No similar papers found.