🤖 AI Summary
Current vision-language models (VLMs) exhibit limited contextual and spatial understanding in tasks requiring fine-grained temporal or geometric reasoning. To address this, this work introduces the first synthetic video benchmark specifically designed to evaluate joint reasoning about contextual awareness—such as assessing behavioral risk—and spatial awareness, including agent interactions and trajectory relationships. Leveraging minimal pair designs, synthetic video generation, and stable color cues, the benchmark enables systematic, training-free diagnostic evaluation of mainstream VLMs. Experimental results reveal that existing models perform only marginally above random chance; while color cues mitigate agent confusion to some extent, fundamental reasoning deficiencies persist. The dataset and code are publicly released to encourage research into lightweight spatial priors for VLMs.
📝 Abstract
Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.