🤖 AI Summary
This work addresses the limited in-context learning capabilities of current large language models (LLMs) in sequential decision-making tasks characterized by partial observability and model ambiguity. The authors propose a novel approach that integrates supervised fine-tuning with in-context learning, leveraging offline expert trajectories to perform few-shot policy imitation on pretrained LLMs. Theoretical analysis reveals that the model’s attention layers implicitly approximate Q-functions, enabling the derivation of an end-to-end suboptimality bound. Empirical results demonstrate that the proposed method substantially narrows the performance gap to optimal policies across multiple synthetic environments, significantly outperforming pure in-context learning and random baselines—particularly in long-horizon, partially observable, and model-ambiguous settings.
📝 Abstract
Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities, yet their potential for sequential decision-making remains underexplored. In this paper, we study the ICL capabilities of LLMs in sequential decision-making settings, including Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), and Ambiguous POMDPs (APOMDPs). We fine-tune pretrained LLMs to perform few-shot decision-making directly from offline, oracle-labeled trajectories. Our framework enables flexible imitation of policies through supervised fine-tuning (SFT). Theoretically, we focus on linear MDPs and interpret a fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data. Building on this interpretation, we derive an end-to-end suboptimality bound for the induced policy that separates the in-context estimation error from the training-length bias. Empirically, across synthetic MDP, POMDP, and APOMDP settings, we find that fine-tuned LLMs achieve substantially smaller optimality gaps than in-context-only and random baselines, with especially large gains in longer-horizon, partially observed, and model-ambiguous environments. Together, these results show that supervised fine-tuning provides an effective route to endowing pretrained LLMs with sequential decision-making capabilities from offline data, which is an important advantage in domains such as healthcare where offline data are abundant.