SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a method to construct a queryable, metrically scaled unified 3D spatial memory from ordinary RGB videos, enabling language-guided spatial reasoning and navigation. By reconstructing real-scale indoor 3D scenes from first-person video, the approach leverages structural elements such as walls, doors, and windows as geometric anchors. It integrates open-vocabulary object representations with hierarchical textual descriptions to build a geometry–semantics–language aligned memory system. This system is the first to unify metric anchoring, open-vocabulary semantics, and hierarchical language within a consistent 3D coordinate frame, supporting efficient storage, rapid retrieval, and interpretable spatial relation reasoning. Experiments across three real-world indoor environments demonstrate high navigation success rates and accurate hierarchical retrieval even under occlusion and clutter, confirming the method’s effectiveness, efficiency, and scalability.

Technology Category

Application Category

📝 Abstract
We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes -- linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates -- for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.
Problem

Research questions and friction points this paper is trying to address.

3D memory
spatial reasoning
language-guided navigation
metric reconstruction
embodied spatial intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Memory
3D Anchoring
Metric Reconstruction
Open-Vocabulary Object Representation
Language-Guided Navigation