Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the prevalent issue of “audio hallucination” in audio-visual language models—wherein models generate descriptions inconsistent with the actual audio due to overreliance on visual cues—by proposing the Audio-Contrastive Preference Optimization (ACPO) framework. ACPO introduces, for the first time, a dual-axis contrastive preference learning mechanism: at the output level, it penalizes generations that erroneously treat visual information as audio evidence through contrastive targeting; at the input level, it employs an audio-swapping strategy to penalize outputs insensitive to genuine audio content. This approach explicitly suppresses audio hallucination, effectively disrupting the model’s dependence on visual shortcuts while significantly improving audio grounding accuracy, all without compromising multimodal comprehension or generation capabilities.

Technology Category

Application Category

📝 Abstract

While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.

Problem

Research questions and friction points this paper is trying to address.

audio-visual language models

cross-modal hallucination

audio hallucination

visual dominance

audio grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Visual Language Models

Audio Hallucination

Preference Optimization