Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of effective evaluation methods for assessing audio large language models’ understanding of multilingual cultural dimensions—such as regional context, language, emotion, and thematic content—in full-length musical pieces. To this end, we introduce MusicCultureQA, the first multilingual question-answering benchmark dedicated to global music culture, encompassing 38 languages and 380 tracks. Through a four-stage automated pipeline, we generate 1,190 human-verified multiple-choice questions, establishing the first cross-lingual QA evaluation framework focused explicitly on music cultural comprehension. Our approach leverages large language models for automatic cultural context generation, attribute extraction, and question formulation, complemented by multilingual audio processing and cross-modal alignment techniques. Experiments reveal that current models struggle to accurately interpret nuanced cultural aspects of music without rich textual context and exhibit systematic biases across diverse musical traditions. The dataset is publicly released on Hugging Face.

Technology Category

Application Category

📝 Abstract
We introduce Voices of Civilizations, the first multilingual QA benchmark for evaluating audio LLMs' cultural comprehension on full-length music recordings. Covering 380 tracks across 38 languages, our automated pipeline yields 1,190 multiple-choice questions through four stages - each followed by manual verification: 1) compiling a representative music list; 2) generating cultural-background documents for each sample in the music list via LLMs; 3) extracting key attributes from those documents; and 4) constructing multiple-choice questions probing language, region associations, mood, and thematic content. We evaluate models under four conditions and report per-language accuracy. Our findings demonstrate that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context and exhibit systematic biases in interpreting music from different cultural traditions. The dataset is publicly available on Hugging Face to foster culturally inclusive music understanding research.
Problem

Research questions and friction points this paper is trying to address.

cultural comprehension
audio LLMs
multilingual QA
music understanding
cultural bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual QA benchmark
audio LLMs
cultural comprehension
music understanding
automated question generation
🔎 Similar Papers
No similar papers found.