🤖 AI Summary
This work addresses the lack of effective evaluation methods for assessing audio large language models’ understanding of multilingual cultural dimensions—such as regional context, language, emotion, and thematic content—in full-length musical pieces. To this end, we introduce MusicCultureQA, the first multilingual question-answering benchmark dedicated to global music culture, encompassing 38 languages and 380 tracks. Through a four-stage automated pipeline, we generate 1,190 human-verified multiple-choice questions, establishing the first cross-lingual QA evaluation framework focused explicitly on music cultural comprehension. Our approach leverages large language models for automatic cultural context generation, attribute extraction, and question formulation, complemented by multilingual audio processing and cross-modal alignment techniques. Experiments reveal that current models struggle to accurately interpret nuanced cultural aspects of music without rich textual context and exhibit systematic biases across diverse musical traditions. The dataset is publicly released on Hugging Face.
📝 Abstract
We introduce Voices of Civilizations, the first multilingual QA benchmark for evaluating audio LLMs' cultural comprehension on full-length music recordings. Covering 380 tracks across 38 languages, our automated pipeline yields 1,190 multiple-choice questions through four stages - each followed by manual verification: 1) compiling a representative music list; 2) generating cultural-background documents for each sample in the music list via LLMs; 3) extracting key attributes from those documents; and 4) constructing multiple-choice questions probing language, region associations, mood, and thematic content. We evaluate models under four conditions and report per-language accuracy. Our findings demonstrate that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context and exhibit systematic biases in interpreting music from different cultural traditions. The dataset is publicly available on Hugging Face to foster culturally inclusive music understanding research.