OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Large multimodal language models (OLLMs) suffer from two types of hallucination: (1) text-prior dominance, causing neglect of audiovisual cues; and (2) separate modeling of audio and visual modalities, impairing understanding of cross-modal implicit associations (e.g., auditory cues embedded in video). This paper introduces the first preference optimization framework targeting multimodal hallucination. We propose a dual preference alignment mechanism that simultaneously constructs text-level and multimodal preference sample pairs, explicitly modeling joint audiovisual semantics and enhancing non-textual modality attention. Our method integrates preference learning, cross-modal alignment, and audiovisual co-representation. Evaluated on two state-of-the-art OLLMs, it significantly reduces hallucination rates, improves multimodal grounding accuracy, and enhances robustness in cross-modal reasoning—establishing a novel paradigm for trustworthy multimodal perception.

Technology Category

Application Category

📝 Abstract

Recently, Omni-modal large language models (OLLMs) have sparked a new wave of research, achieving impressive results in tasks such as audio-video understanding and real-time environment perception. However, hallucination issues still persist. Similar to the bimodal setting, the priors from the text modality tend to dominate, leading OLLMs to rely more heavily on textual cues while neglecting visual and audio information. In addition, fully multimodal scenarios introduce new challenges. Most existing models align visual or auditory modalities with text independently during training, while ignoring the intrinsic correlations between video and its corresponding audio. This oversight results in hallucinations when reasoning requires interpreting hidden audio cues embedded in video content. To address these challenges, we propose OmniDPO, a preference-alignment framework designed to mitigate hallucinations in OLLMs. Specifically, OmniDPO incorporates two strategies: (1) constructing text-preference sample pairs to enhance the model's understanding of audio-video interactions; and (2) constructing multimodal-preference sample pairs to strengthen the model's attention to visual and auditory information. By tackling both challenges, OmniDPO effectively improves multimodal grounding and reduces hallucination. Experiments conducted on two OLLMs demonstrate that OmniDPO not only effectively mitigates multimodal hallucinations but also significantly enhances the models' reasoning capabilities across modalities. All code and datasets will be released upon paper acceptance.

Problem

Research questions and friction points this paper is trying to address.

Addressing omni-modal hallucination in multimodal large language models

Reducing textual dominance over visual and audio information in OLLMs

Mitigating neglect of intrinsic audio-video correlations during reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference alignment framework for multimodal hallucination

Text-preference pairs to enhance audio-video interactions

Multimodal-preference pairs to strengthen visual-audio attention

🔎 Similar Papers

No similar papers found.