Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

πŸ“… 2026-05-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

195K/year
πŸ€– AI Summary
This work addresses the susceptibility of existing audio-visual large language models to cross-modal interference during inference, which often leads to hallucinations. To mitigate this issue, the authors propose a β€œseparate-then-fuse” framework that first performs modality-specific chain-of-thought reasoning independently on audio and visual inputs, then fuses the resulting evidence to generate answers. Additionally, they introduce modality preference labels as auxiliary rewards in reinforcement learning, coupled with a data-driven preference annotation pipeline to model instance-level modality preferences. Evaluated on standard audio-visual question answering (AVQA) benchmarks, the method achieves a relative accuracy improvement of 5.16% on general tasks and 11.17% on cross-modal hallucination benchmarks, substantially enhancing model robustness and reliability.
πŸ“ Abstract
Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.
Problem

Research questions and friction points this paper is trying to address.

cross-modal interference
audio-visual reasoning
multimodal hallucination
modality-specific reasoning
audio-visual question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal interference
modality-specific chain-of-thought
Separate First Fuse Later
audio-visual reasoning
reinforcement learning with modality preference
πŸ”Ž Similar Papers
No similar papers found.