🤖 AI Summary
This work addresses the prevalent issue of “audio hallucination” in audio-visual language models—wherein models generate descriptions inconsistent with the actual audio due to overreliance on visual cues—by proposing the Audio-Contrastive Preference Optimization (ACPO) framework. ACPO introduces, for the first time, a dual-axis contrastive preference learning mechanism: at the output level, it penalizes generations that erroneously treat visual information as audio evidence through contrastive targeting; at the input level, it employs an audio-swapping strategy to penalize outputs insensitive to genuine audio content. This approach explicitly suppresses audio hallucination, effectively disrupting the model’s dependence on visual shortcuts while significantly improving audio grounding accuracy, all without compromising multimodal comprehension or generation capabilities.
📝 Abstract
While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.