π€ AI Summary
This work addresses the challenge that large language models (LLMs) struggle to effectively leverage contextual interaction experience for learning in interactive, feedback-delayed online decision-making tasks. To overcome this limitation, the authors propose ORBIT, a novel framework that introduces cross-episode meta-reinforcement learning into LLM training for the first time. Through multi-task, multi-episode meta-training, ORBIT enables models to perform efficient online learning during inference using only their interaction history as context. Experiments with Qwen3-14B demonstrate that this approach significantly enhances adaptation and decision-making performance in unseen environments. Notably, a small-scale open-source model trained with ORBIT achieves performance comparable to GPT-5.2, substantially outperforming conventional reinforcement learning fine-tuning methods, with gains consistently scaling with model size.
π Abstract
Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront, as in static prediction and instruction-following problems. However, many real-world decision-making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in-context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in-context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context. After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments, matching the performance of GPT-5.2 and outperforming standard RL fine-tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn-at-inference-time decision-making agents. Code reproducing the results in the paper can be found at https://github.com/XiaofengLin7/ORBIT.