🤖 AI Summary
Large language models (LLMs) suffer from error propagation due to early incorrect assumptions and struggle to maintain consistent tracking of evolving user goals across multi-turn interactions, resulting in weak long-horizon planning capabilities. To address this, we propose a parameter-free, reinforcement learning–inspired prompt optimization framework that dynamically rewrites task instructions via turn-based feedback generation, experience replay, and a meta-prompt agent—enabling internal modeling of long-term goal consistency. Our approach requires no model parameter updates, relying solely on prompt engineering and external feedback, and exhibits strong cross-model generalizability. Evaluated on multi-turn benchmarks—including text-to-SQL and task-oriented dialogue—we observe significant performance gains, demonstrating effectiveness in sustained understanding, goal tracking, and zero-shot transfer. This work establishes a lightweight, scalable paradigm for long-horizon planning in LLMs.
📝 Abstract
Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.