Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing grid-based cognitive maps suffer from discretization artifacts, limiting fine-grained spatial reasoning. To address this, we propose a Continuous Boundary Coordinate Cognitive Map framework that models object positions, dimensions, and spatial relations in a metric, grounded continuous space—enabling precise quantitative spatial computation. We introduce QVS-Bench, a novel benchmark for systematically analyzing the relationship between the number of input images and spatial reasoning accuracy. Leveraging the AI2THOR simulator, we curate a high-quality dataset and adopt a two-stage training strategy: supervised fine-tuning followed by reinforcement fine-tuning. Our approach jointly optimizes continuous coordinate regression and vision–language alignment to achieve accurate spatial reconstruction. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our V2LO-7B model achieves an average 4.92% improvement over grid-based methods, demonstrating significantly enhanced fine-grained spatial understanding.

Technology Category

Application Category

📝 Abstract
Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.
Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of discretized raster representations in spatial reasoning
Reconstructing metric-grounded spatial layouts from video inputs
Enabling quantitative spatial computation to reduce natural language ambiguity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses continuous object boundary coordinates for spatial layout
Employs supervised fine-tuning with AI2THOR dataset
Applies reinforcement fine-tuning for real-world generalization
🔎 Similar Papers
No similar papers found.
Y
Yibin Huang
Faculty of Computing, Harbin Institute of Technology
Wang Xu
Wang Xu
Harbin Institute of Technology
natural language processingartificial intelligence
Wanyue Zhang
Wanyue Zhang
Max Planck Institute for Informatics
Video GenerationAnimationHuman Scene InteractionRobotics
H
Helu Zhi
Faculty of Computing, Harbin Institute of Technology
J
Jingjing Huang
Tsinghua University
Y
Yangbin Xu
Institute of Microelectronics of the Chinese Academy of Sciences
Y
Yangang Sun
Tsinghua University
C
Conghui Zhu
Faculty of Computing, Harbin Institute of Technology
T
Tiejun Zhao
Faculty of Computing, Harbin Institute of Technology