Scalable In-Context Q-Learning

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weak generalization and poor robustness to suboptimal trajectories remain key challenges in context-aware reinforcement learning. To address these, this paper proposes Contextual Q-Learning (CQL), a scalable framework that unifies dynamic programming with a universal world model. CQL introduces a novel multi-head prompt-based Transformer architecture that explicitly decouples the policy network from the context-dependent value function estimator. It further incorporates upper-quantile Q-fitting and advantage-weighted distillation to enhance contextual inference accuracy under suboptimal data. Through synergistic optimization of prompt compression and pretraining, CQL achieves substantial improvements over state-of-the-art baselines on both discrete and continuous control benchmarks—particularly excelling in scenarios dominated by suboptimal trajectories, where it demonstrates consistent performance gains. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose extbf{S}calable extbf{I}n- extbf{C}ontext extbf{Q}- extbf{L}earning ( extbf{SICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at https://github.com/NJU-RL/SICQL
Problem

Research questions and friction points this paper is trying to address.

Extends in-context learning to reinforcement decision domains
Addresses challenges in learning from suboptimal trajectories
Improves precision in in-context inference for task generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic programming and world modeling
Prompt-based multi-head transformer architecture
Iterative policy improvement via value function
🔎 Similar Papers
No similar papers found.
J
Jinmei Liu
Department of Control and Systems Engineering, Nanjing University, Nanjing, China
F
Fuhong Liu
Department of Control and Systems Engineering, Nanjing University, Nanjing, China
Jianye Hao
Jianye Hao
Huawei Noah's Ark Lab/Tianjin University
Multiagent SystemsEmbodied AI
B
Bo Wang
Department of Control and Systems Engineering, Nanjing University, Nanjing, China
Huaxiong Li
Huaxiong Li
Nanjing University
Machine LearningData MiningPattern RecognitionComputer Vision
Chunlin Chen
Chunlin Chen
Nanjing University
Reinforcement LearningQuantum ControlMobile Robotics
Z
Zhi Wang
Department of Control and Systems Engineering, Nanjing University, Nanjing, China