PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capability of existing large audio language models to model personal context, which hinders their performance in personalized question-answering tasks. We formalize, for the first time, the task of Personalized Audio Language Modeling (PALM) and introduce PALM-Bench, the first comprehensive benchmark for evaluating personalized concept recognition and reasoning across multi-speaker, multi-task scenarios. Through systematic experiments on open-source large audio language models—combining training-free prompting with supervised fine-tuning—we reveal significant limitations in current approaches regarding personalized knowledge modeling and cross-task generalization. Our findings establish a clear direction for future research and provide a standardized evaluation framework to advance the development of truly personalized audio language models.

Technology Category

Application Category

📝 Abstract
Large Audio-Language Models (LALMs) have demonstrated strong performance in audio understanding and generation. Yet, our extensive benchmarking reveals that their behavior is largely generic (e.g., summarizing spoken content) and fails to adequately support personalized question answering (e.g., summarizing what my best friend says). In contrast, human conditions their interpretation and decision-making on each individual's personal context. To bridge this gap, we formalize the task of Personalized LALMs (PALM) for recognizing personal concepts and reasoning within personal context. Moreover, we create the first benchmark (PALM-Bench) to foster the methodological advances in PALM and enable structured evaluation on several tasks across multi-speaker scenarios. Our extensive experiments on representative open-source LALMs, show that existing training-free prompting and supervised fine-tuning strategies, while yield improvements, remains limited in modeling personalized knowledge and transferring them across tasks robustly. Data and code will be released.
Problem

Research questions and friction points this paper is trying to address.

Personalized Audio-Language Models
Personalized Question Answering
Audio-Language Understanding
Personal Context Reasoning
Multi-speaker Scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized Audio-Language Models
PALM-Bench
personal context reasoning
multi-speaker scenarios
audio-language benchmarking
🔎 Similar Papers
No similar papers found.
Y
Yuwen Wang
University of Science and Technology Beijing, China
Xinyuan Qian
Xinyuan Qian
Associate Professor, University of Science and Technology Beijing, China
speech processingmultimediahuman robot interaction
Tian-Hao Zhang
Tian-Hao Zhang
Phd, University of Science & Technology Beijing
Speech LLMASRTTS
J
Jiaran Gao
University of Science and Technology Beijing, China
Y
Yuchen Pan
University of Science and Technology Beijing, China
X
Xin Wang
Li Auto, China
Z
Zhou Pan
Li Auto, China
C
Chen Wei
Li Auto, China
Yiming Wang
Yiming Wang
Researcher at Deep Visual Learning (DVL) Unit, Fondazione Bruno Kessler
scene understandingvision & language modelsembodied AI