🤖 AI Summary
This study investigates whether video large language models (VLLMs) can reliably associate entities with events in a temporally coherent manner when confronted with irrelevant distractors, such as advertisements. To this end, the authors introduce DistractionBench, a novel benchmark that employs controlled interventions, injected video segments, and systematic evaluation to uncover and formally define a “Bag-of-Events” behavior in VLLMs—where models disregard temporal structure and instead rely solely on event co-occurrence for reasoning. Experiments across 11 state-of-the-art VLLMs reveal that all exhibit this behavior prominently, frequently misattributing entities to events under distraction, thereby exposing a fundamental deficiency in their temporal understanding.
📝 Abstract
A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.