EchoVLA: Robotic Vision-Language-Action Model with Synergistic Declarative Memory for Mobile Manipulation

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing vision-language-action (VLA) models struggle with long-horizon mobile manipulation tasks due to insufficient cross-temporal environmental understanding, spatial memory, and joint navigation-manipulation reasoning. To address this, we propose a dual-memory VLA architecture that integrates a spatial-semantic map (scene memory) with multimodal task experience (experience memory), enabling cross-episode information retrieval and policy generation. Our method introduces fine-grained multimodal attention fusion, a diffusion-based policy network, and MoMani—a fully automated trajectory generation and optimization framework powered by multimodal large language models. Evaluated in both simulation and real-world settings, our model achieves a 31% success rate (SR) on long-horizon mobile manipulation, outperforming prior baselines by 11 percentage points. To the best of our knowledge, this is the first end-to-end VLA system capable of joint navigation-manipulation decision-making in dynamic, open-world environments through explicit memory augmentation.

Technology Category

Application Category

📝 Abstract

Recent progress in Vision-Language-Action (VLA) models has enabled embodied agents to interpret multimodal instructions and perform complex tasks. However, existing VLAs are mostly confined to short-horizon, table-top manipulation, lacking the memory and reasoning capability required for long-horizon mobile manipulation, where agents must coordinate navigation and manipulation under changing spatial contexts. In this work, we present EchoVLA, a memory-aware VLA model for long-horizon mobile manipulation. EchoVLA incorporates a synergistic declarative memory inspired by the human brain, consisting of a scene memory that maintains a collection of spatial-semantic maps and an episodic memory that stores task-level experiences with multimodal contextual features. During both training and inference, the two memories are individually stored, updated, and retrieved based on current observations, task history, and instructions, and their retrieved representations are fused via coarse- and fine-grained attention to guide mobile-arm diffusion policies. To support large-scale training and evaluation, we further introduce MoMani, an automated benchmark that generates expert-level long-horizon trajectories through multimodal large language model (MLLM)-guided planning and feedback-driven refinement, supplemented with real-robot demonstrations. Experiments in simulated and real-world settings show that EchoVLA improves long-horizon performance, reaching 0.52 SR on manipulation/navigation and 0.31 on mobile manipulation, exceeding $π_{0.5}$ by +0.08 and +0.11.

Problem

Research questions and friction points this paper is trying to address.

Developing memory-aware vision-language-action models for long-horizon mobile manipulation tasks

Addressing coordination between navigation and manipulation under changing spatial contexts

Overcoming limitations of existing VLAs confined to short-horizon table-top manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synergistic declarative memory with scene and episodic components

Coarse- and fine-grained attention for memory fusion

MLLM-guided automated benchmark for training trajectories

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Robotics AI Engineer Sr. Staff/Principal Engineer – Embodied AI/Vision Language Action Models

Qualcomm

$221,600.00 - $332,400.00

Santa Clara, California, United States of America / San Diego, California, United States of America

Authors to Follow