Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

πŸ“… 2026-01-07
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work investigates whether large language models (LLMs) possess spatial intelligence in the absence of pixel-based inputs, disentangling whether their spatial understanding stems from visual encoders or intrinsic reasoning capabilities of the language model itself. To this end, we introduce SiT-Bench, a novel benchmark that transforms multi-view 3D scenes into high-fidelity, coordinate-aware textual descriptions, enabling the first systematic evaluation of purely text-driven spatial reasoning. The benchmark comprises over 3,800 expert-annotated samples spanning five major categories and seventeen subtasks. Our findings reveal that while LLMs perform well on local semantic tasks, they exhibit a pronounced β€œspatial gap” in maintaining global spatial consistency. Importantly, incorporating explicit spatial reasoning mechanisms effectively unlocks their latent world-modeling capacity, leading to significant performance gains.

Technology Category

Application Category

πŸ“ Abstract
Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant"spatial gap"remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at https://github.com/binisalegend/SiT-Bench .
Problem

Research questions and friction points this paper is trying to address.

Spatial Intelligence
Large Language Models
Textual Reasoning
Vision-Language Models
Spatial Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Intelligence
Large Language Models
Text-only Benchmark
Symbolic Reasoning
World Modeling
πŸ”Ž Similar Papers
No similar papers found.