Solving Spatial Supersensing Without Spatial Supersensing

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work identifies a critical evaluation flaw in the VSI-Super benchmark for video spatial super-perception (VSR/VSC): state-of-the-art performance can be closely approximated by non-spatiotemporal methods that ignore video structure. To expose this, we introduce NoSense—a baseline using only a bag-of-words SigLIP model—that achieves 95% VSR accuracy on 4-hour videos. We further propose VSC-Repeat, a diagnostic benchmark enforcing scene repetition to surface reliance on implicit dataset shortcuts (e.g., “non-repeating rooms”). Experiments reveal Cambrian-S’s average relative VSC accuracy collapses from 42% to 0% on VSC-Repeat, confirming its gains stem from shortcut exploitation—not genuine spatial reasoning. This is the first systematic demonstration that current benchmarks fail to distinguish true spatial super-perception capability from spurious correlation matching. Our work provides both methodological foundations and empirical evidence for building robust, shortcut-resistant evaluation frameworks.

Technology Category

Application Category

📝 Abstract

Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity

Problem

Research questions and friction points this paper is trying to address.

Current spatial supersensing benchmarks fail to reliably measure spatial cognition

Benchmark solutions exploit shortcut heuristics rather than genuine spatial reasoning

Tailored inference methods inadvertently exploit dataset biases instead of supersensing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced NoSense baseline using bag-of-words SigLIP

Designed VSC-Repeat test by concatenating videos repeatedly

Revealed inference exploits shortcuts not spatial supersensing

🔎 Similar Papers

TV-based Deep 3D Self Super-Resolution for fMRI