How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work investigates whether pure text-based large language models (LLMs) inherently encode transferable auditory knowledge and how such knowledge influences audio-language model performance—a question that has lacked systematic exploration. To address this, the authors introduce AKB-2000, the first comprehensive benchmark for evaluating auditory knowledge in both breadth and depth. They systematically analyze the auditory knowledge reservoirs and transfer capabilities of mainstream LLM families through three paradigms: direct probing, cascaded audio-description reasoning, and fine-tuned audio grounding evaluation. The results reveal significant variation among LLMs in their implicit auditory knowledge, and demonstrate that their performance on text-only evaluations strongly predicts downstream audio-task effectiveness. These findings highlight the latent role of text pretraining in supporting audio understanding and offer theoretical grounding for future multimodal model design.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

Problem

Research questions and friction points this paper is trying to address.

auditory knowledge

large language models

audio language models

text-only pre-training

downstream performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

auditory knowledge

large language models

audio language models