Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of compositional hallucinations in current video multimodal large language models, which arise from complex interactions among multiple spatiotemporal factors and lack effective mitigation mechanisms. The study presents the first systematic definition and construction of the OmniVCHall benchmark, enabling comprehensive evaluation of both isolated and compositional hallucinations. To tackle this issue, the authors propose TriCD, a contrastive decoding framework that integrates dynamic perturbation control, saliency-guided attention, and a three-path calibration mechanism to effectively suppress complex hallucinatory outputs. Experimental results demonstrate that TriCD improves average accuracy by over 10% on two mainstream model backbones, significantly enhancing reasoning reliability in complex video understanding scenarios.

Technology Category

Application Category

📝 Abstract

Current research on video hallucination mitigation primarily focuses on isolated error types, leaving compositional hallucinations, arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored. We introduce OmniVCHall, a benchmark designed to systematically evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs). OmniVCHall spans diverse video domains, introduces a novel camera-based hallucination type, and defines a fine-grained taxonomy, together with adversarial answer options (e.g.,"All are correct"and"None of the above") to prevent shortcut reasoning. The evaluations of 39 representative VLLMs reveal that even advanced models (e.g., Qwen3-VL and GPT-5) exhibit substantial performance degradation. We propose TriCD, a contrastive decoding framework with a triple-pathway calibration mechanism. An adaptive perturbation controller dynamically selects distracting operations to construct negative video variants, while a saliency-guided enhancement module adaptively reinforces grounded token-wise visual evidences. These components are optimized via reinforcement learning to encourage precise decision-making under compositional hallucination settings. Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%. The data and code can be find at https://github.com/BMRETURN/OmniVCHall.

Problem

Research questions and friction points this paper is trying to address.

compositional hallucination

video multimodal large language models

hallucination mitigation

spatial-temporal reasoning

benchmark evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional hallucination

video multimodal LLMs

contrastive decoding