🤖 AI Summary
Large language model (LLM) agents exhibit prompt sensitivity in tool invocation and multi-step reasoning, and lack efficient terminal-state validators—such as exact ground-truth answers or low-cost evaluators—for reliable execution.
Method: This paper proposes a progressive prompt optimization framework that operates without terminal-state verification. Its core innovation is a gradient-style prompt update mechanism guided by intermediate interaction feedback, integrating dialogue history modeling, reflection-driven prompt rewriting, and iterative instruction fine-tuning—specifically adapted to structured planning formalisms like PDDL.
Contribution/Results: The method eliminates reliance on gold-standard answers or expensive evaluation oracles, enabling continuous online optimization. Experiments across three diverse tasks—PDDL generation, travel planning, and meeting scheduling—demonstrate substantial improvements in success rate and robustness, confirming strong cross-domain generalization capability.
📝 Abstract
In the past year, large language models (LLMs) have had remarkable success in domains outside the traditional natural language processing, and their capacity is further expanded into the so-called LLM agents when connected with external tools. In all domains, the prompt to the LLMs has been shown to make a big difference in what the LLM would generate and thus affect the performance of the LLM agents. Therefore, automatic prompt engineering (APE) has become an important question for many researchers and users of LLMs. However, previous works in APE rely on a final checker to evaluate the performance of the given prompt -- a requirement that is hard to meet in the case of LLM agents, where intermediate feedback is easier to obtain, and the final evaluation could be expensive, inaccurate, or even missing. In this paper, we propose a novel method, extsc{RePrompt}, which does a ``gradient descent"-like approach to optimize the step-by-step instructions in the prompts given to LLM agents, based on the chat history obtained from interactions and reflections with LLM agents. By leveraging intermediate feedback, extsc{RePrompt} can optimize the prompt without the need for a final solution checker. We evaluate our approach on PDDL generation, TravelPlanner, and Meeting Planning to show that our method could generally improve performance for different reasoning tasks.