MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

πŸ“… 2026-03-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the susceptibility of multimodal large language models (Omni LLMs) to cross-modal hallucinations in audio-visual understanding, primarily caused by spurious inter-modal associations and dominant language priors. To mitigate this, the authors propose Modality-decoupled Direct Preference Optimization (MoD-DPO), which introduces modality decoupling into the preference optimization framework for the first time. MoD-DPO employs a modality-aware regularization term to explicitly disentangle the influence of relevant versus irrelevant modalities and incorporates a language-prior debiasing penalty to enhance the model’s sensitivity to genuine multimodal signals. Evaluated on multiple audio-visual hallucination benchmarks, the method significantly improves perceptual accuracy and robustness against hallucinations, outperforming existing preference optimization approaches while maintaining comparable training costs.

Technology Category

Application Category

πŸ“ Abstract
Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.
Problem

Research questions and friction points this paper is trying to address.

cross-modal hallucinations
omni LLMs
modality grounding
language priors
multimodal foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality Decoupling
Cross-modal Hallucination
Preference Optimization
Multimodal Alignment
Language Prior Debiasing
πŸ”Ž Similar Papers
No similar papers found.