ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
How to efficiently transform lightweight vision-language models (VLMs) into embodied agents capable of perception, reasoning, and interaction remains a central challenge in embodied AI. This paper proposes ERA, a two-stage framework: Stage I employs trajectory-augmented, environment-anchored, and external-knowledge-guided knowledge distillation to bridge the capability gap in compact models; Stage II integrates self-summarization, dense reward shaping, and episode-level policy optimization for online reinforcement learning—addressing long-horizon planning and sparse-reward challenges. Evaluated on EB-ALFRED and EB-Manipulation benchmarks, ERA-3B surpasses GPT-4o by 8.4% and 19.4%, respectively, demonstrating superior generalization and sample efficiency. Our results validate the feasibility and state-of-the-art performance of lightweight embodied agents.

Technology Category

Application Category

📝 Abstract
Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, we present extit{Embodied Reasoning Agent (ERA)}, a two-stage framework that integrates prior knowledge learning and online reinforcement learning (RL). The first stage, extit{Embodied Prior Learning}, distills foundational knowledge from three types of data: (1) Trajectory-Augmented Priors, which enrich existing trajectory data with structured reasoning generated by stronger models; (2) Environment-Anchored Priors, which provide in-environment knowledge and grounding supervision; and (3) External Knowledge Priors, which transfer general knowledge from out-of-environment datasets. In the second stage, we develop an online RL pipeline that builds on these priors to further enhance agent performance. To overcome the inherent challenges in agent RL, including long horizons, sparse rewards, and training instability, we introduce three key designs: self-summarization for context management, dense reward shaping, and turn-level policy optimization. Extensive experiments on both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate that ERA-3B surpasses both prompting-based large models and previous training-based baselines. Specifically, it achieves overall improvements of 8.4% on EB-ALFRED and 19.4% on EB-Manipulation over GPT-4o, and exhibits strong generalization to unseen tasks. Overall, ERA offers a practical path toward scalable embodied intelligence, providing methodological insights for future embodied AI systems.
Problem

Research questions and friction points this paper is trying to address.

Bridging the gap between costly large VLMs and underperforming smaller VLMs for embodied AI.
Integrating prior knowledge learning and online RL to enhance agent capabilities in complex environments.
Overcoming long horizons, sparse rewards, and instability in embodied agent reinforcement learning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework integrates prior learning and online RL
Embodied Prior Learning distills knowledge from three data types
Online RL pipeline uses self-summarization and dense reward shaping
🔎 Similar Papers
2024-07-09IEEE/ASME transactions on mechatronicsCitations: 94
2024-10-04International Conference on Learning RepresentationsCitations: 0