LAGEA: Language Guided Embodied Agents for Robotic Manipulation

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenge of leveraging natural language feedback to enhance embodied agents’ error diagnosis and behavioral correction capabilities in robotic manipulation tasks. To this end, we propose LAGEA—a novel framework that (1) aligns temporally grounded visual states with structured linguistic reflections generated by vision-language models (VLMs) in a shared representation space, enabling precise mapping of language feedback to temporally localized reward signals; and (2) introduces adaptive weighting coefficients to dynamically modulate feedback influence, allowing agents to autonomously refine policies through trial-and-error. Integrating VLMs, reinforcement learning, temporal grounding, and reward shaping, LAGEA establishes a closed-loop, language-driven learning mechanism. Evaluated on the Meta-World MT10 benchmark, LAGEA achieves new state-of-the-art success rates—improving over prior methods by 9.0% (random goals) and 5.3% (fixed goals)—while demonstrating faster policy convergence.

Technology Category

Application Category

📝 Abstract

Robotic manipulation benefits from foundation models that describe goals, but today's agents still lack a principled way to learn from their own mistakes. We ask whether natural language can serve as feedback, an error reasoning signal that helps embodied agents diagnose what went wrong and correct course. We introduce LAGEA (Language Guided Embodied Agents), a framework that turns episodic, schema-constrained reflections from a vision language model (VLM) into temporally grounded guidance for reinforcement learning. LAGEA summarizes each attempt in concise language, localizes the decisive moments in the trajectory, aligns feedback with visual state in a shared representation, and converts goal progress and feedback agreement into bounded, step-wise shaping rewardswhose influence is modulated by an adaptive, failure-aware coefficient. This design yields dense signals early when exploration needs direction and gracefully recedes as competence grows. On the Meta-World MT10 embodied manipulation benchmark, LAGEA improves average success over the state-of-the-art (SOTA) methods by 9.0% on random goals and 5.3% on fixed goals, while converging faster. These results support our hypothesis: language, when structured and grounded in time, is an effective mechanism for teaching robots to self-reflect on mistakes and make better choices. Code will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Enabling robots to learn from mistakes using natural language feedback

Converting vision-language model reflections into reinforcement learning guidance

Improving robotic manipulation success rates through structured self-reflection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language feedback guides robotic agents to correct errors

VLM episodic reflections convert into reinforcement learning rewards

Adaptive reward modulation improves exploration and task success

🔎 Similar Papers

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation