VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenge of high-confidence hallucinations in video visual language models (Video-VLMs) during question answering, which existing uncertainty metrics struggle to detect reliably. To this end, the authors propose VideoHEDGE, a framework that generates multiple answer hypotheses through spatiotemporal and photometric perturbations combined with high-temperature sampling. It constructs an answer distribution via semantic clustering—using either embedding similarity or natural language inference (NLI)—and introduces semantic entropy along with a novel metric, Vision-Amplified Semantic Entropy (VASE), to assess response reliability. This study is the first to extend entropy-based uncertainty estimation to temporally grounded video inputs and demonstrates that embedding-based clustering can effectively replace costly NLI methods. Evaluated on the SoccerChat benchmark, VASE achieves state-of-the-art ROC-AUC across several 7B-scale Video-VLMs, significantly outperforming baselines under strong perturbations, and the authors release the hedge-bench library to support reproducible evaluation.

Technology Category

Application Category

📝 Abstract

Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .

Problem

Research questions and friction points this paper is trying to address.

hallucination detection

video vision-language models

uncertainty estimation

semantic reliability

video question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

VideoHEDGE

semantic clustering

spatiotemporal perturbations