🤖 AI Summary
This work addresses the bottleneck in embodied intelligence development—namely, its reliance on handcrafted rewards and manual hyperparameter tuning—by introducing the first closed-loop benchmark that enables large language model (LLM) agents to autonomously develop embodied policies. Built upon a high-fidelity simulation environment, the framework integrates reinforcement learning, imitation learning, and diffusion policies, using executable code as the interface to support dynamic iteration across perception, debugging, and optimization. Evaluated on 32 expert-designed tasks, the agent achieves an average success rate 26.5% higher than human-engineered baselines, demonstrates the ability to self-recover from near-failure states, and substantially narrows the performance gap between open-source and closed-source models. This represents the first paradigm shift from static code generation to autonomous policy engineering.
📝 Abstract
The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs'success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.