EchoVLA: Robotic Vision-Language-Action Model with Synergistic Declarative Memory for Mobile Manipulation

πŸ“… 2025-11-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language-action (VLA) models struggle with long-horizon mobile manipulation tasks due to insufficient cross-temporal environmental understanding, spatial memory, and joint navigation-manipulation reasoning. To address this, we propose a dual-memory VLA architecture that integrates a spatial-semantic map (scene memory) with multimodal task experience (experience memory), enabling cross-episode information retrieval and policy generation. Our method introduces fine-grained multimodal attention fusion, a diffusion-based policy network, and MoManiβ€”a fully automated trajectory generation and optimization framework powered by multimodal large language models. Evaluated in both simulation and real-world settings, our model achieves a 31% success rate (SR) on long-horizon mobile manipulation, outperforming prior baselines by 11 percentage points. To the best of our knowledge, this is the first end-to-end VLA system capable of joint navigation-manipulation decision-making in dynamic, open-world environments through explicit memory augmentation.

Technology Category

Application Category

πŸ“ Abstract
Recent progress in Vision-Language-Action (VLA) models has enabled embodied agents to interpret multimodal instructions and perform complex tasks. However, existing VLAs are mostly confined to short-horizon, table-top manipulation, lacking the memory and reasoning capability required for long-horizon mobile manipulation, where agents must coordinate navigation and manipulation under changing spatial contexts. In this work, we present EchoVLA, a memory-aware VLA model for long-horizon mobile manipulation. EchoVLA incorporates a synergistic declarative memory inspired by the human brain, consisting of a scene memory that maintains a collection of spatial-semantic maps and an episodic memory that stores task-level experiences with multimodal contextual features. During both training and inference, the two memories are individually stored, updated, and retrieved based on current observations, task history, and instructions, and their retrieved representations are fused via coarse- and fine-grained attention to guide mobile-arm diffusion policies. To support large-scale training and evaluation, we further introduce MoMani, an automated benchmark that generates expert-level long-horizon trajectories through multimodal large language model (MLLM)-guided planning and feedback-driven refinement, supplemented with real-robot demonstrations. Experiments in simulated and real-world settings show that EchoVLA improves long-horizon performance, reaching 0.52 SR on manipulation/navigation and 0.31 on mobile manipulation, exceeding $Ο€_{0.5}$ by +0.08 and +0.11.
Problem

Research questions and friction points this paper is trying to address.

Developing memory-aware vision-language-action models for long-horizon mobile manipulation tasks
Addressing coordination between navigation and manipulation under changing spatial contexts
Overcoming limitations of existing VLAs confined to short-horizon table-top manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synergistic declarative memory with scene and episodic components
Coarse- and fine-grained attention for memory fusion
MLLM-guided automated benchmark for training trajectories
πŸ”Ž Similar Papers
No similar papers found.