🤖 AI Summary
Lifelong embodied navigation faces challenges in accumulating and reusing cross-task spatial-semantic experience. Existing object-centric memory approaches rely on detection and reconstruction, limiting robustness and scalability. This paper proposes an image-centric long-horizon implicit memory framework: a lightweight, end-to-end visual context compression module—comprising a ViT backbone, frozen DINOv3 features, PixelUnshuffle, and light-weight convolutions—enables configurable high-ratio compression (e.g., 16× yielding only 30 tokens per image), scaling contextual capacity from dozens to hundreds of frames. The memory is jointly trained end-to-end with a Qwen2.5-VL multimodal navigation policy. This paradigm balances interpretability, robustness, and scalability. It achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON: significantly improving exploration efficiency in novel environments and markedly reducing path length in familiar ones. Ablation studies confirm that moderate compression ratios optimally balance accuracy and efficiency.
📝 Abstract
Lifelong embodied navigation requires agents to accumulate, retain, and exploit spatial-semantic experience across tasks, enabling efficient exploration in novel environments and rapid goal reaching in familiar ones. While object-centric memory is interpretable, it depends on detection and reconstruction pipelines that limit robustness and scalability. We propose an image-centric memory framework that achieves long-term implicit memory via an efficient visual context compression module end-to-end coupled with a Qwen2.5-VL-based navigation policy. Built atop a ViT backbone with frozen DINOv3 features and lightweight PixelUnshuffle+Conv blocks, our visual tokenizer supports configurable compression rates; for example, under a representative 16$ imes$ compression setting, each image is encoded with about 30 tokens, expanding the effective context capacity from tens to hundreds of images. Experimental results on GOAT-Bench and HM3D-OVON show that our method achieves state-of-the-art navigation performance, improving exploration in unfamiliar environments and shortening paths in familiar ones. Ablation studies further reveal that moderate compression provides the best balance between efficiency and accuracy. These findings position compressed image-centric memory as a practical and scalable interface for lifelong embodied agents, enabling them to reason over long visual histories and navigate with human-like efficiency.