🤖 AI Summary
This work addresses the limited capability of existing large audio language models to model personal context, which hinders their performance in personalized question-answering tasks. We formalize, for the first time, the task of Personalized Audio Language Modeling (PALM) and introduce PALM-Bench, the first comprehensive benchmark for evaluating personalized concept recognition and reasoning across multi-speaker, multi-task scenarios. Through systematic experiments on open-source large audio language models—combining training-free prompting with supervised fine-tuning—we reveal significant limitations in current approaches regarding personalized knowledge modeling and cross-task generalization. Our findings establish a clear direction for future research and provide a standardized evaluation framework to advance the development of truly personalized audio language models.
📝 Abstract
Large Audio-Language Models (LALMs) have demonstrated strong performance in audio understanding and generation. Yet, our extensive benchmarking reveals that their behavior is largely generic (e.g., summarizing spoken content) and fails to adequately support personalized question answering (e.g., summarizing what my best friend says). In contrast, human conditions their interpretation and decision-making on each individual's personal context. To bridge this gap, we formalize the task of Personalized LALMs (PALM) for recognizing personal concepts and reasoning within personal context. Moreover, we create the first benchmark (PALM-Bench) to foster the methodological advances in PALM and enable structured evaluation on several tasks across multi-speaker scenarios. Our extensive experiments on representative open-source LALMs, show that existing training-free prompting and supervised fine-tuning strategies, while yield improvements, remains limited in modeling personalized knowledge and transferring them across tasks robustly. Data and code will be released.