Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study investigates the limitations of vision-language models (VLMs) in event-level visual reasoning—specifically their capacity for temporal ordering, causal inference, spatial reasoning, contextual understanding, and commonsense reasoning. To address the lack of dedicated benchmarks, we introduce SPLICE, the first multidimensional evaluation benchmark for event reasoning: built upon the COIN dataset, it comprises 3,381 manually curated videos and features an event-segment reordering task augmented with textual descriptions to systematically assess leading VLMs. Experimental results reveal that current VLMs heavily rely on linguistic priors while exhibiting weak visual grounding; although they achieve moderate performance on everyday scenarios and temporal–causal reasoning, their overall event-level spatiotemporal modeling falls far short of human capability—exposing a fundamental bottleneck. This work pioneers and empirically validates a fine-grained, multidimensional evaluation paradigm explicitly designed for event reasoning.

Technology Category

Application Category

📝 Abstract

In this work, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset, designed to probe event-based reasoning across multiple dimensions: temporal, causal, spatial, contextual, and general knowledge. SPLICE includes 3,381 human-filtered videos spanning 12 categories and 180 sub-categories, such as sports, engineering, and housework. These videos are segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences to assess visual reasoning capabilities. Results reveal a significant gap: VLMs struggle to match human performance. While human-annotated textual descriptions improve model accuracy, they do not affect human performance, suggesting that models rely more on language priors than on visual understanding. Even with annotations, VLMs fall short of human-level reasoning, underscoring persistent challenges in visual reasoning. A deeper analysis across sub-categories shows that VLMs perform relatively better on videos where temporal and causal reasoning are dominant, compared to those where contextual and spatial reasoning are dominant. They also perform better on everyday tasks than on specialized ones.

Problem

Research questions and friction points this paper is trying to address.

Evaluating visual reasoning in VLMs using event sequence rearrangement tasks

Assessing VLMs' performance across temporal, causal, spatial, and contextual reasoning

Identifying gaps between human and model capabilities in visual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-curated benchmark for visual reasoning evaluation

Segmented video clips for event sequence rearrangement

Assessed VLMs across multiple reasoning dimensions

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?