🤖 AI Summary
This study investigates the limitations of vision-language models (VLMs) in event-level visual reasoning—specifically their capacity for temporal ordering, causal inference, spatial reasoning, contextual understanding, and commonsense reasoning. To address the lack of dedicated benchmarks, we introduce SPLICE, the first multidimensional evaluation benchmark for event reasoning: built upon the COIN dataset, it comprises 3,381 manually curated videos and features an event-segment reordering task augmented with textual descriptions to systematically assess leading VLMs. Experimental results reveal that current VLMs heavily rely on linguistic priors while exhibiting weak visual grounding; although they achieve moderate performance on everyday scenarios and temporal–causal reasoning, their overall event-level spatiotemporal modeling falls far short of human capability—exposing a fundamental bottleneck. This work pioneers and empirically validates a fine-grained, multidimensional evaluation paradigm explicitly designed for event reasoning.
📝 Abstract
In this work, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset, designed to probe event-based reasoning across multiple dimensions: temporal, causal, spatial, contextual, and general knowledge. SPLICE includes 3,381 human-filtered videos spanning 12 categories and 180 sub-categories, such as sports, engineering, and housework. These videos are segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences to assess visual reasoning capabilities. Results reveal a significant gap: VLMs struggle to match human performance. While human-annotated textual descriptions improve model accuracy, they do not affect human performance, suggesting that models rely more on language priors than on visual understanding. Even with annotations, VLMs fall short of human-level reasoning, underscoring persistent challenges in visual reasoning. A deeper analysis across sub-categories shows that VLMs perform relatively better on videos where temporal and causal reasoning are dominant, compared to those where contextual and spatial reasoning are dominant. They also perform better on everyday tasks than on specialized ones.