Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current visual-spatial intelligence (VSI) evaluation relies heavily on text-based prompts and VQA-style scoring, which introduces linguistic shortcuts and obscures geometric relationships, hindering faithful attribution of spatial reasoning capability. Method: We propose the Spatial Intelligence Grid (SIG)—a grid-based, structured representation that jointly encodes object layouts, geometric relations, and physical priors to support embodied spatial reasoning—and introduce SIGBench, the first autonomous-driving-oriented benchmark featuring 1,400 real-world frames with expert SIG annotations and human gaze trajectories for language-agnostic, human-aligned attention evaluation. Contribution/Results: SIG decouples linguistic and spatial competencies for the first time. In few-shot settings, it significantly improves stability, interpretability, and performance consistency of spatial understanding across GPT and Gemini models. This work establishes a new paradigm and standard for VSI modeling and evaluation.

Technology Category

Application Category

📝 Abstract

How to integrate and verify spatial intelligence in foundation models remains an open challenge. Current practice often proxies Visual-Spatial Intelligence (VSI) with purely textual prompts and VQA-style scoring, which obscures geometry, invites linguistic shortcuts, and weakens attribution to genuinely spatial skills. We introduce Spatial Intelligence Grid (SIG): a structured, grid-based schema that explicitly encodes object layouts, inter-object relations, and physically grounded priors. As a complementary channel to text, SIG provides a faithful, compositional representation of scene structure for foundation-model reasoning. Building on SIG, we derive SIG-informed evaluation metrics that quantify a model's intrinsic VSI, which separates spatial capability from language priors. In few-shot in-context learning with state-of-the-art multimodal LLMs (e.g. GPT- and Gemini-family models), SIG yields consistently larger, more stable, and more comprehensive gains across all VSI metrics compared to VQA-only representations, indicating its promise as a data-labeling and training schema for learning VSI. We also release SIGBench, a benchmark of 1.4K driving frames annotated with ground-truth SIG labels and human gaze traces, supporting both grid-based machine VSI tasks and attention-driven, human-like VSI tasks in autonomous-driving scenarios.

Problem

Research questions and friction points this paper is trying to address.

Integrating spatial intelligence into foundation models remains an open challenge

Current methods obscure geometry and weaken genuine spatial skill attribution

Proposing structured grid schema to explicitly encode object layouts and relations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Intelligence Grid encodes object layouts and relations

SIG provides compositional scene structure representation for reasoning

SIG-informed metrics separate spatial capability from language priors

🔎 Similar Papers

MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations