🤖 AI Summary
This work addresses the susceptibility of audio large language models (ALLMs) to hallucinations in audio understanding, a problem exacerbated by coarse existing evaluation protocols and mitigation strategies that rely on costly fine-tuning. To tackle this, the authors propose Noise-Aware In-Context Learning (NAICL), a plug-and-play approach that constructs a noise prior library and retrieves noise examples relevant to the input audio to serve as contextual priors. This guides the model toward more conservative generation when acoustic evidence is insufficient. The study further introduces Clotho-1K, the first fine-grained benchmark for evaluating auditory hallucinations across four distinct categories. Without any model fine-tuning, NAICL significantly reduces the overall hallucination rate of ALLMs on audio captioning tasks from 26.53% to 16.98%.
📝 Abstract
Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classification tasks, which are insufficient to characterize the more complex hallucination patterns that arise in generative tasks. Moreover, current hallucination mitigation strategies rely on fine-tuning, resulting in high computational costs. To address the above limitations, we propose a plug-and-play Noise-Aware In-Context Learning (NAICL) method. Specifically, we construct a noise prior library, retrieve noise examples relevant to the input audio, and incorporate them as contextual priors, thereby guiding the model to reduce speculative associations when acoustic evidence is insufficient and to adopt a more conservative generation strategy. In addition, we establish a hallucination benchmark for audio caption tasks including the construction of the Clotho-1K multi-event benchmark dataset, the definition of four types of auditory hallucinations, and the introduction of metrics such as hallucination type distribution to support fine-grained analysis. Experimental results show that all evaluated ALLMs exhibit same hallucination behaviors. Moreover, the proposed NAICL method reduces the overall hallucination rate from 26.53% to 16.98%.