Vision-Language Memory for Spatial Reasoning

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) significantly underperform humans in video-based spatial reasoning, primarily due to misalignment between semantic and geometric representations and the absence of a persistent memory mechanism supporting long-horizon 3D understanding. To address this, we propose a dual-memory-augmented VLM: a working memory employing a sliding window to model short-term dynamic inter-view relationships, and an episodic memory that selectively stores critical 3D structural information. These memories jointly mitigate semantic–geometric mismatch and enable long-term spatial reasoning with fixed computational overhead. Our model learns view-consistent, 3D-aware joint representations directly from 2D videos. It achieves state-of-the-art performance among purely video-based models on multiple video spatial reasoning benchmarks and substantially enhances robots’ long-horizon spatial understanding and reasoning capabilities in dynamic scenes.

Technology Category

Application Category

📝 Abstract
Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.
Problem

Research questions and friction points this paper is trying to address.

Addressing semantic-geometric misalignment in 3D understanding
Developing persistent memory for long-term spatial reasoning
Enhancing video-based spatial reasoning with fixed computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces dual-memory module for spatial reasoning
Uses working and episodic memory for context
Enables 3D-aware representation from 2D video
🔎 Similar Papers
No similar papers found.