Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
This work addresses the limited in-context learning capabilities of current large language models (LLMs) in sequential decision-making tasks characterized by partial observability and model ambiguity. The authors propose a novel approach that integrates supervised fine-tuning with in-context learning, leveraging offline expert trajectories to perform few-shot policy imitation on pretrained LLMs. Theoretical analysis reveals that the model’s attention layers implicitly approximate Q-functions, enabling the derivation of an end-to-end suboptimality bound. Empirical results demonstrate that the proposed method substantially narrows the performance gap to optimal policies across multiple synthetic environments, significantly outperforming pure in-context learning and random baselines—particularly in long-horizon, partially observable, and model-ambiguous settings.
📝 Abstract
Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities, yet their potential for sequential decision-making remains underexplored. In this paper, we study the ICL capabilities of LLMs in sequential decision-making settings, including Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), and Ambiguous POMDPs (APOMDPs). We fine-tune pretrained LLMs to perform few-shot decision-making directly from offline, oracle-labeled trajectories. Our framework enables flexible imitation of policies through supervised fine-tuning (SFT). Theoretically, we focus on linear MDPs and interpret a fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data. Building on this interpretation, we derive an end-to-end suboptimality bound for the induced policy that separates the in-context estimation error from the training-length bias. Empirically, across synthetic MDP, POMDP, and APOMDP settings, we find that fine-tuned LLMs achieve substantially smaller optimality gaps than in-context-only and random baselines, with especially large gains in longer-horizon, partially observed, and model-ambiguous environments. Together, these results show that supervised fine-tuning provides an effective route to endowing pretrained LLMs with sequential decision-making capabilities from offline data, which is an important advantage in domains such as healthcare where offline data are abundant.
Problem

Research questions and friction points this paper is trying to address.

Sequential Decision-Making
In-Context Learning
Large Language Models
Markov Decision Processes
Offline Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

supervised fine-tuning
in-context learning
sequential decision-making
offline reinforcement learning
large language models
🔎 Similar Papers
2024-06-17Conference on Empirical Methods in Natural Language ProcessingCitations: 3