π€ AI Summary
Existing vision-language-action (VLA) models struggle with long-horizon mobile manipulation tasks due to insufficient cross-temporal environmental understanding, spatial memory, and joint navigation-manipulation reasoning. To address this, we propose a dual-memory VLA architecture that integrates a spatial-semantic map (scene memory) with multimodal task experience (experience memory), enabling cross-episode information retrieval and policy generation. Our method introduces fine-grained multimodal attention fusion, a diffusion-based policy network, and MoManiβa fully automated trajectory generation and optimization framework powered by multimodal large language models. Evaluated in both simulation and real-world settings, our model achieves a 31% success rate (SR) on long-horizon mobile manipulation, outperforming prior baselines by 11 percentage points. To the best of our knowledge, this is the first end-to-end VLA system capable of joint navigation-manipulation decision-making in dynamic, open-world environments through explicit memory augmentation.
π Abstract
Recent progress in Vision-Language-Action (VLA) models has enabled embodied agents to interpret multimodal instructions and perform complex tasks. However, existing VLAs are mostly confined to short-horizon, table-top manipulation, lacking the memory and reasoning capability required for long-horizon mobile manipulation, where agents must coordinate navigation and manipulation under changing spatial contexts. In this work, we present EchoVLA, a memory-aware VLA model for long-horizon mobile manipulation. EchoVLA incorporates a synergistic declarative memory inspired by the human brain, consisting of a scene memory that maintains a collection of spatial-semantic maps and an episodic memory that stores task-level experiences with multimodal contextual features. During both training and inference, the two memories are individually stored, updated, and retrieved based on current observations, task history, and instructions, and their retrieved representations are fused via coarse- and fine-grained attention to guide mobile-arm diffusion policies. To support large-scale training and evaluation, we further introduce MoMani, an automated benchmark that generates expert-level long-horizon trajectories through multimodal large language model (MLLM)-guided planning and feedback-driven refinement, supplemented with real-robot demonstrations. Experiments in simulated and real-world settings show that EchoVLA improves long-horizon performance, reaching 0.52 SR on manipulation/navigation and 0.31 on mobile manipulation, exceeding $Ο_{0.5}$ by +0.08 and +0.11.