Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing LVLM evaluation benchmarks predominantly assess overall performance, overlooking contextual position bias—a critical yet systematically unexplored issue in video-language modeling. Method: We introduce Video-LevelGauge, the first dedicated benchmark for quantifying position bias in video-language models. It employs controlled context configurations, a manually curated, diverse video dataset, and standardized probing tasks integrating multiple-choice and open-ended question formats. Leveraging statistical analysis and morphological pattern recognition, we design high-fidelity, multi-type questions to precisely expose positional biases. Contribution/Results: Evaluating 27 mainstream models reveals that most open-source models exhibit significant head or proximity bias, whereas commercial models such as Gemini 2.5-Pro demonstrate robust positional invariance. This work establishes a reproducible methodological framework and empirical foundation for bias attribution, model diagnostics, and robustness enhancement in video-language understanding.

Technology Category

Application Category

📝 Abstract

Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement.

Problem

Research questions and friction points this paper is trying to address.

Assessing positional bias in large video language models

Evaluating contextual influence across video sequences

Identifying performance variations based on probe position

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized probes and contextual setups

Statistical and morphological pattern recognition analysis

Manually curated videos with multiple question types

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs