🤖 AI Summary
Traditional robot policy optimization relies on gradient-based learning or fine-tuning, limiting interpretability, generalizability, and applicability in low-data or real-world settings.
Method: We propose a large language model (LLM)-based iterative self-improvement framework for robot policies—requiring neither gradients nor parameter fine-tuning. We first uncover LLMs’ intrinsic stochastic numerical optimization capability and design the SAS (Strategy–Action–Synthesis) prompting framework, which unifies policy reasoning, trajectory retrieval, feedback synthesis, and policy update within a single prompt. Integrated with robot-executed trajectory memory and iterative human- or environment-derived feedback, the framework enables autonomous policy evolution.
Contribution/Results: Evaluated on simulated and real-world tabletop ping-pong tasks, our method significantly improves task success rate and cross-scenario behavioral generalization. Results empirically validate LLMs as effective, interpretable, and gradient-free universal policy optimizers.
📝 Abstract
We demonstrate the ability of large language models (LLMs) to perform iterative self-improvement of robot policies. An important insight of this paper is that LLMs have a built-in ability to perform (stochastic) numerical optimization and that this property can be leveraged for explainable robot policy search. Based on this insight, we introduce the SAS Prompt (Summarize, Analyze, Synthesize) -- a single prompt that enables iterative learning and adaptation of robot behavior by combining the LLM's ability to retrieve, reason and optimize over previous robot traces in order to synthesize new, unseen behavior. Our approach can be regarded as an early example of a new family of explainable policy search methods that are entirely implemented within an LLM. We evaluate our approach both in simulation and on a real-robot table tennis task. Project website: sites.google.com/asu.edu/sas-llm/