Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the limited in-context learning capabilities of current large language models (LLMs) in sequential decision-making tasks characterized by partial observability and model ambiguity. The authors propose a novel approach that integrates supervised fine-tuning with in-context learning, leveraging offline expert trajectories to perform few-shot policy imitation on pretrained LLMs. Theoretical analysis reveals that the model’s attention layers implicitly approximate Q-functions, enabling the derivation of an end-to-end suboptimality bound. Empirical results demonstrate that the proposed method substantially narrows the performance gap to optimal policies across multiple synthetic environments, significantly outperforming pure in-context learning and random baselines—particularly in long-horizon, partially observable, and model-ambiguous settings.

📝 Abstract

Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities, yet their potential for sequential decision-making remains underexplored. In this paper, we study the ICL capabilities of LLMs in sequential decision-making settings, including Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), and Ambiguous POMDPs (APOMDPs). We fine-tune pretrained LLMs to perform few-shot decision-making directly from offline, oracle-labeled trajectories. Our framework enables flexible imitation of policies through supervised fine-tuning (SFT). Theoretically, we focus on linear MDPs and interpret a fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data. Building on this interpretation, we derive an end-to-end suboptimality bound for the induced policy that separates the in-context estimation error from the training-length bias. Empirically, across synthetic MDP, POMDP, and APOMDP settings, we find that fine-tuned LLMs achieve substantially smaller optimality gaps than in-context-only and random baselines, with especially large gains in longer-horizon, partially observed, and model-ambiguous environments. Together, these results show that supervised fine-tuning provides an effective route to endowing pretrained LLMs with sequential decision-making capabilities from offline data, which is an important advantage in domains such as healthcare where offline data are abundant.

Problem

Research questions and friction points this paper is trying to address.

Sequential Decision-Making

In-Context Learning

Large Language Models

Markov Decision Processes

Offline Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

supervised fine-tuning

in-context learning

sequential decision-making