🤖 AI Summary
Current audio-language model evaluations predominantly focus on generic speech or globally ubiquitous sounds, neglecting culturally specific non-semantic soundscapes—resulting in significant blind spots in localized sound understanding. To address this, we introduce TAU, the first benchmark dedicated to geographically grounded non-semantic sounds. Built upon everyday acoustic environments in Taiwan, TAU comprises 702 audio clips and 1,794 culturally grounded multiple-choice questions, collaboratively authored by local domain experts and large language models; answers require cultural contextual reasoning rather than speech transcription. Experimental results show that state-of-the-art models—including Gemini 2.5 and Qwen2-Audio—achieve substantially lower accuracy than local human annotators, empirically confirming their cultural perception gap. This work fills a critical void in culturally sensitive audio understanding evaluation and advances a more pluralistic, equitable multimodal assessment paradigm.
📝 Abstract
Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.