Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Audio-Language Models (LALMs) frequently exhibit object hallucinations in audio question answering—generating factually inconsistent answers unsupported by the input audio. To address this, we propose Audio-Aware Decoding (AAD), a lightweight, fine-tuning-free inference-time method that dynamically reweights token probabilities based on audio relevance. AAD computes and contrasts token logits with and without audio input, then calibrates the audio-conditioned token generation distribution via contrastive logit adjustment. This represents the first application of contrastive decoding to audio-language joint reasoning, enabling real-time, audio-aware decoding control. On object hallucination benchmarks, AAD improves F1 scores by 0.046–0.428; on general audio QA tasks (e.g., Clotho-AQA), it boosts accuracy by 5.4%–10.3%. The method effectively mitigates hallucinations while preserving inference efficiency and model integrity.

Technology Category

Application Category

📝 Abstract
Large Audio-Language Models (LALMs) can take audio and text as the inputs and answer questions about the audio. While prior LALMs have shown strong performance on standard benchmarks, there has been alarming evidence that LALMs can hallucinate what is presented in the audio. To mitigate the hallucination of LALMs, we introduce Audio-Aware Decoding (AAD), a lightweight inference-time strategy that uses contrastive decoding to compare the token prediction logits with and without the audio context. By contrastive decoding, AAD promotes the tokens whose probability increases when the audio is present. We conduct our experiment on object hallucination datasets with three LALMs and show that AAD improves the F1 score by 0.046 to 0.428. We also show that AAD can improve the accuracy on general audio QA datasets like Clotho-AQA by 5.4% to 10.3%. We conduct thorough ablation studies to understand the effectiveness of each component in AAD.
Problem

Research questions and friction points this paper is trying to address.

Reducing object hallucination in Large Audio-Language Models
Improving audio-aware decoding for accurate token prediction
Enhancing performance on audio QA datasets via contrastive decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses contrastive decoding for audio-aware token prediction
Compares logits with and without audio context
Improves F1 score and accuracy significantly
🔎 Similar Papers
No similar papers found.
T
Tzu-wen Hsu
Department of Computer Science, Purdue University, United States
Ke-Han Lu
Ke-Han Lu
National Taiwan University
Nature Language ProcessingSpeech Recognition
C
Cheng-Han Chiang
Graduate Institute of Communication Engineering, National Taiwan University, Taiwan
Hung-yi Lee
Hung-yi Lee
National Taiwan University
deep learningspoken language understandingspeech processing