Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing grid-based cognitive maps suffer from discretization artifacts, limiting fine-grained spatial reasoning. To address this, we propose a Continuous Boundary Coordinate Cognitive Map framework that models object positions, dimensions, and spatial relations in a metric, grounded continuous space—enabling precise quantitative spatial computation. We introduce QVS-Bench, a novel benchmark for systematically analyzing the relationship between the number of input images and spatial reasoning accuracy. Leveraging the AI2THOR simulator, we curate a high-quality dataset and adopt a two-stage training strategy: supervised fine-tuning followed by reinforcement fine-tuning. Our approach jointly optimizes continuous coordinate regression and vision–language alignment to achieve accurate spatial reconstruction. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our V2LO-7B model achieves an average 4.92% improvement over grid-based methods, demonstrating significantly enhanced fine-grained spatial understanding.

Technology Category

Application Category

📝 Abstract

Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.

Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of discretized raster representations in spatial reasoning

Reconstructing metric-grounded spatial layouts from video inputs

Enabling quantitative spatial computation to reduce natural language ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses continuous object boundary coordinates for spatial layout

Employs supervised fine-tuning with AI2THOR dataset

Applies reinforcement fine-tuning for real-world generalization

🔎 Similar Papers

Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind