Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

231K/year
πŸ€– AI Summary
This work addresses the limitation of existing vision-language models (VLMs) in spatial reasoning, which stems from their lack of explicit three-dimensional scene topology and reliance on non-geometric semantic features. The study is the first to uncover and model the implicit 3D topological structure within VLMs, introducing a theoretically grounded regularization approach. This method employs cross-scenario linear probes to extract latent subspaces and constrains them using Laplacian eigenmaps, Gaussian kernel graphs, and Dirichlet energy. Remarkably, with only 500 steps of fine-tuning on synthetic data, the proposed technique outperforms standard fine-tuning and strong baselines by up to 12.1% on real-world spatial reasoning tasks, substantially enhancing the model’s spatial generalization capabilities.
πŸ“ Abstract
Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate that current VLMs do possess a latent topological map of 3D scenes, but it is heavily overshadowed by non-geometric visual semantics, such as color and shape. By isolating this spatial subspace through cross-scene linear feature extraction, we extract a clean spatial subspace that causally controls the model's spatial outputs. We mathematically shape this latent representation and prove its correspondence to the Laplacian eigenmaps of the scene's 3D Gaussian-kernel graph, converging to the physical 3D space in the continuous limit. Motivated by this geometric identification, we further introduce a mathematically principled latent regularization method for VLMs, based on Dirichlet energy. Applying this single-term regularizer to a minimal 500-step supervised VLM fine-tuning (SFT) on simple synthetic data yields significant improvements on real-world spatial benchmarks, outperforming standard SFT and competitive baselines by up to 12.1\% in spatial tasks involving scene topology understanding. Source code is available at https://github.com/pittisl/vlm-latent-shaping
Problem

Research questions and friction points this paper is trying to address.

3D scene topology
Vision-Language Models
latent representation
spatial reasoning
cognitive maps
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent representation
3D scene topology
Vision-Language Models
Dirichlet energy
Laplacian eigenmaps
πŸ”Ž Similar Papers
No similar papers found.