Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study

๐Ÿ“… 2025-10-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current visual-spatial intelligence (VSI) evaluation relies heavily on text-based prompts and VQA-style scoring, which introduces linguistic shortcuts and obscures geometric relationships, hindering faithful attribution of spatial reasoning capability. Method: We propose the Spatial Intelligence Grid (SIG)โ€”a grid-based, structured representation that jointly encodes object layouts, geometric relations, and physical priors to support embodied spatial reasoningโ€”and introduce SIGBench, the first autonomous-driving-oriented benchmark featuring 1,400 real-world frames with expert SIG annotations and human gaze trajectories for language-agnostic, human-aligned attention evaluation. Contribution/Results: SIG decouples linguistic and spatial competencies for the first time. In few-shot settings, it significantly improves stability, interpretability, and performance consistency of spatial understanding across GPT and Gemini models. This work establishes a new paradigm and standard for VSI modeling and evaluation.

Technology Category

Application Category

๐Ÿ“ Abstract
How to integrate and verify spatial intelligence in foundation models remains an open challenge. Current practice often proxies Visual-Spatial Intelligence (VSI) with purely textual prompts and VQA-style scoring, which obscures geometry, invites linguistic shortcuts, and weakens attribution to genuinely spatial skills. We introduce Spatial Intelligence Grid (SIG): a structured, grid-based schema that explicitly encodes object layouts, inter-object relations, and physically grounded priors. As a complementary channel to text, SIG provides a faithful, compositional representation of scene structure for foundation-model reasoning. Building on SIG, we derive SIG-informed evaluation metrics that quantify a model's intrinsic VSI, which separates spatial capability from language priors. In few-shot in-context learning with state-of-the-art multimodal LLMs (e.g. GPT- and Gemini-family models), SIG yields consistently larger, more stable, and more comprehensive gains across all VSI metrics compared to VQA-only representations, indicating its promise as a data-labeling and training schema for learning VSI. We also release SIGBench, a benchmark of 1.4K driving frames annotated with ground-truth SIG labels and human gaze traces, supporting both grid-based machine VSI tasks and attention-driven, human-like VSI tasks in autonomous-driving scenarios.
Problem

Research questions and friction points this paper is trying to address.

Integrating spatial intelligence into foundation models remains an open challenge
Current methods obscure geometry and weaken genuine spatial skill attribution
Proposing structured grid schema to explicitly encode object layouts and relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Intelligence Grid encodes object layouts and relations
SIG provides compositional scene structure representation for reasoning
SIG-informed metrics separate spatial capability from language priors
๐Ÿ”Ž Similar Papers
No similar papers found.
Guanlin Wu
Guanlin Wu
Johns Hopkins University
Computer VisionMLLMs
B
Boyan Su
Johns Hopkins University
Y
Yang Zhao
Johns Hopkins University
P
Pu Wang
Johns Hopkins University
Y
Yichen Lin
Johns Hopkins University
Hao Frank Yang
Hao Frank Yang
Assistant Professor, Johns Hopkins University, CaSE and DSAI
Representation LearningComputing SystemsDecision MakingTransportation