Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing video spatial reasoning methods often employ generic memory mechanisms that introduce redundant or irrelevant geometric information, thereby impairing long-term reasoning capabilities. This work proposes a question-guided geometric memory framework that efficiently accumulates and integrates spatial evidence relevant and novel to the given question. The framework injects geometric features conditioned on camera poses and employs a dual-memory architecture comprising a fine-grained contextual memory bank and a semantic-geometric evidence bank. A Q-Former module evaluates the relevance and novelty of incoming evidence relative to the question, enabling adaptive read/write operations and dynamic memory capacity control. Experiments on VSI-Bench and VSTI-Bench demonstrate state-of-the-art performance, validating the effectiveness and novelty of the proposed approach.

📝 Abstract

Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf{\ours}, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.

Problem

Research questions and friction points this paper is trying to address.

video spatial reasoning

geometric memory

question-guided

long-horizon reasoning

spatial video-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

question-guided memory

geometric reasoning

video spatial reasoning