🤖 AI Summary
To address the challenges of domain adaptation and poor few-shot generalization of large language models (LLMs) in sequential decision-making tasks, this paper proposes a memory-driven self-optimization framework. The method establishes a bidirectional enhancement mechanism between LLMs’ prior knowledge and a compact experience memory—comprising interaction trajectories and Q-values—enabling continual decision capability evolution via value-aware policy iteration, dynamic experience replay, and closed-loop trajectory optimization. Its core innovation lies in the organic integration of symbolic reasoning with numerical value estimation, yielding an interpretable and evolvable decision-learning paradigm. Evaluated on the ALFWorld benchmark, the approach achieves over 40% improvement in in-distribution task accuracy and over 75% gain in cross-task zero-shot generalization, significantly outperforming both pure reinforcement learning and state-of-the-art LLM-based baselines.
📝 Abstract
Large language models (LLMs) have emerged as effective action policies for sequential decision-making (SDM) tasks due to their extensive prior knowledge. However, this broad yet general knowledge is often insufficient for specific decision-making tasks with limited task-related data, making it challenging to efficiently adapt LLMs to specific SDM tasks. To address this challenge, we propose a memory-driven self-improvement framework that combines LLM general prior knowledge with a compact memory of domain-specific experiences. Memory retains past interactions and associated Q-values, thereby capturing decision-relevant knowledge that facilitates accurate value estimation and informs the LLM prior refinement. The refined LLM prior, in turn, generates higher-reward trajectories that further enrich memory, forming a natural self-improvement framework where memory and LLM prior mutually reinforce each other. Experiments show that our memory-driven approach significantly outperforms both traditional RL and LLM-based baselines, e.g., improving performance by over 40% on in-distribution tasks and over 75% when generalized to unseen tasks in ALFWorld.