🤖 AI Summary
It remains unclear whether current video-language models genuinely comprehend the temporal dynamics, motion, and semantics of videos. To address this, this work proposes REVEAL—the first systematic diagnostic benchmark—comprising five controlled stress tests: reversed playback, false statement injection, spatiotemporal occlusion, camera motion simulation, and language shortcut interference. Coupled with an automated pipeline for generating controllable perturbations, REVEAL comprehensively evaluates model robustness in fundamental perception and reasoning. Experiments reveal that both leading open-source and proprietary models exhibit significant fragility across basic tasks, frequently misdescribing reversed videos, overlooking visual evidence, or accepting incorrect claims, whereas human participants perform these tasks with ease. These findings expose fundamental deficiencies in existing models’ video understanding capabilities.
📝 Abstract
This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.