Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models

πŸ“… 2025-08-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing LVLM evaluation benchmarks predominantly assess overall performance, overlooking contextual position biasβ€”a critical yet systematically unexplored issue in video-language modeling. Method: We introduce Video-LevelGauge, the first dedicated benchmark for quantifying position bias in video-language models. It employs controlled context configurations, a manually curated, diverse video dataset, and standardized probing tasks integrating multiple-choice and open-ended question formats. Leveraging statistical analysis and morphological pattern recognition, we design high-fidelity, multi-type questions to precisely expose positional biases. Contribution/Results: Evaluating 27 mainstream models reveals that most open-source models exhibit significant head or proximity bias, whereas commercial models such as Gemini 2.5-Pro demonstrate robust positional invariance. This work establishes a reproducible methodological framework and empirical foundation for bias attribution, model diagnostics, and robustness enhancement in video-language understanding.

Technology Category

Application Category

πŸ“ Abstract
Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement.
Problem

Research questions and friction points this paper is trying to address.

Assessing positional bias in large video language models
Evaluating contextual influence across video sequences
Identifying performance variations based on probe position
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized probes and contextual setups
Statistical and morphological pattern recognition analysis
Manually curated videos with multiple question types
πŸ”Ž Similar Papers
H
Hou Xia
University of Science and Technology of China, Hefei, China
Zheren Fu
Zheren Fu
University of Science and Technology of China
Multi-modal LearningVision-Language ModelAI Security
F
Fangcan Ling
University of Science and Technology of China, Hefei, China
J
Jiajun Li
HUAWEI, Shanghai, China
Yi Tu
Yi Tu
Ant Group
Computer VisionDocument UnderstandingVision Language Model
Zhendong Mao
Zhendong Mao
University of Science and Technology of China
CV,NLP
Y
Yongdong Zhang
University of Science and Technology of China, Hefei, China