Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Weak generalization and insufficient embodiment grounding of Vision-Language Models (VLMs) hinder their deployment in real-world robotic systems. To address this, we propose ExpTeach—a framework that synergistically combines self-generated long-term experience memory with Retrieval-Augmented Generation (RAG) to enable cross-task knowledge reuse, enhanced spatial understanding, and autonomous closed-loop learning. ExpTeach autonomously plans, executes, verifies outcomes, reflects on failures, and dynamically updates its memory, supporting advanced embodied behaviors such as creative tool use. Technically, it integrates on-demand image annotation, self-supervised experience summarization, and VLM-driven closed-loop optimization. Evaluated across 12 real-world scenarios, ExpTeach increases single-trial success rates from 22% to 80%; for four high-difficulty tasks, success rises from 36% to 84%, significantly improving zero-shot transfer and continual adaptation capabilities.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.

Problem

Research questions and friction points this paper is trying to address.

Grounding VLMs to diverse real-world robots effectively

Enhancing VLM spatial understanding with image annotation

Improving robotic task success rates through memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-generated memory for VLM grounding

Closed-loop action planning and adaptation

On-demand image annotation for spatial understanding

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance