🤖 AI Summary
This work proposes a method to construct a queryable, metrically scaled unified 3D spatial memory from ordinary RGB videos, enabling language-guided spatial reasoning and navigation. By reconstructing real-scale indoor 3D scenes from first-person video, the approach leverages structural elements such as walls, doors, and windows as geometric anchors. It integrates open-vocabulary object representations with hierarchical textual descriptions to build a geometry–semantics–language aligned memory system. This system is the first to unify metric anchoring, open-vocabulary semantics, and hierarchical language within a consistent 3D coordinate frame, supporting efficient storage, rapid retrieval, and interpretable spatial relation reasoning. Experiments across three real-world indoor environments demonstrate high navigation success rates and accurate hierarchical retrieval even under occlusion and clutter, confirming the method’s effectiveness, efficiency, and scalability.
📝 Abstract
We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes -- linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates -- for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.