TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio-language model evaluations predominantly focus on generic speech or globally ubiquitous sounds, neglecting culturally specific non-semantic soundscapes—resulting in significant blind spots in localized sound understanding. To address this, we introduce TAU, the first benchmark dedicated to geographically grounded non-semantic sounds. Built upon everyday acoustic environments in Taiwan, TAU comprises 702 audio clips and 1,794 culturally grounded multiple-choice questions, collaboratively authored by local domain experts and large language models; answers require cultural contextual reasoning rather than speech transcription. Experimental results show that state-of-the-art models—including Gemini 2.5 and Qwen2-Audio—achieve substantially lower accuracy than local human annotators, empirically confirming their cultural perception gap. This work fills a critical void in culturally sensitive audio understanding evaluation and advances a more pluralistic, equitable multimodal assessment paradigm.

Technology Category

Application Category

📝 Abstract
Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.
Problem

Research questions and friction points this paper is trying to address.

Evaluating audio models on culturally distinctive non-semantic sounds
Assessing model generalization to localized community-recognized audio cues
Revealing cultural blind spots in multimodal AI through localized benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipeline combines curated sources and human editing
Uses LLM-assisted multiple-choice question generation
Benchmark tests cultural sound understanding beyond semantics
🔎 Similar Papers
No similar papers found.
Yi-Cheng Lin
Yi-Cheng Lin
National Taiwan University
Speech ProcessingMachine LearningFairness
Yu-Hua Chen
Yu-Hua Chen
National Taiwan University
Music information retrieval
J
Jia-Kai Dong
National Taiwan University
Y
Yueh-Hsuan Huang
National Taiwan University
S
Szu-Chi Chen
National Taiwan University
Y
Yu-Chen Chen
National Taiwan University
C
Chih-Yao Chen
National Taiwan University
Y
Yu-Jung Lin
National Taiwan University
Y
Yu-Ling Chen
National Taiwan University
Z
Zih-Yu Chen
National Taiwan University
I
I-Ning Tsai
National Taiwan University
H
Hsiu-Hsuan Wang
National Taiwan University
H
Ho-Lam Chung
National Taiwan University
Ke-Han Lu
Ke-Han Lu
National Taiwan University
Nature Language ProcessingSpeech Recognition
Hung-yi Lee
Hung-yi Lee
National Taiwan University
deep learningspoken language understandingspeech processing