ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents

πŸ“… 2026-03-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language model–driven embodied agents often fail in complex tasks due to a disconnect between static training data and dynamic physical interactions, leading to skipped steps, ineffective actions, or repeated errors. To address this, this work proposes ELITE, a novel framework that integrates two complementary mechanisms: self-reflective knowledge construction and intention-aware strategy retrieval. ELITE enables unsupervised online self-improvement through a self-reflection policy that extracts and maintains a structured strategy pool, while its intention-aware retrieval mechanism facilitates cross-task knowledge transfer. Evaluated on the EB-ALFRED and EB-Habitat benchmarks, ELITE achieves performance gains of 9% and 5%, respectively, under unsupervised online settings, and outperforms state-of-the-art methods even in supervised scenarios, demonstrating strong generalization to unseen task categories.

Technology Category

Application Category

πŸ“ Abstract
Vision-language models (VLMs) have shown remarkable general capabilities, yet embodied agents built on them fail at complex tasks, often skipping critical steps, proposing invalid actions, and repeating mistakes. These failures arise from a fundamental gap between the static training data of VLMs and the physical interaction for embodied tasks. VLMs can learn rich semantic knowledge from static data but lack the ability to interact with the world. To address this issue, we introduce ELITE, an embodied agent framework with {E}xperiential {L}earning and {I}ntent-aware {T}ransfer that enables agents to continuously learn from their own environment interaction experiences, and transfer acquired knowledge to procedurally similar tasks. ELITE operates through two synergistic mechanisms, \textit{i.e.,} self-reflective knowledge construction and intent-aware retrieval. Specifically, self-reflective knowledge construction extracts reusable strategies from execution trajectories and maintains an evolving strategy pool through structured refinement operations. Then, intent-aware retrieval identifies relevant strategies from the pool and applies them to current tasks. Experiments on the EB-ALFRED and EB-Habitat benchmarks show that ELITE achieves 9\% and 5\% performance improvement over base VLMs in the online setting without any supervision. In the supervised setting, ELITE generalizes effectively to unseen task categories, achieving better performance compared to state-of-the-art training-based methods. These results demonstrate the effectiveness of ELITE for bridging the gap between semantic understanding and reliable action execution.
Problem

Research questions and friction points this paper is trying to address.

embodied agents
vision-language models
physical interaction
task execution
learning gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Experiential Learning
Intent-Aware Transfer
Self-Reflective Knowledge Construction
Embodied Agents
Strategy Retrieval
πŸ”Ž Similar Papers
No similar papers found.
B
Bingqing Wei
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Z
Zhongyu Xia
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
D
Dingai Liu
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Xiaoyu Zhou
Xiaoyu Zhou
Peking University
Computer VisionAutonomous DrivingAI Security
Zhiwei Lin
Zhiwei Lin
Peking University
3D perceptionopen-world perceptionself-supervised learningautonomous driving
Y
Yongtao Wang
Wangxuan Institute of Computer Technology, Peking University, Beijing, China