🤖 AI Summary
This study investigates the capability and limitations of large language models (LLMs) to perform online in-context reinforcement learning (ICRL) without explicit answer supervision—relying solely on environmental reward signals. We propose a context-based ICRL framework integrating semantic/abstract label feedback and dynamic prompt engineering, and systematically evaluate implicit contextual bandit learning across model scales from 500M to 70B parameters. Our key contributions are threefold: (1) First empirical evidence that LLMs possess zero-shot online classification adaptability without fine-tuning; (2) Demonstration that this ability scales strongly with model size, yet suffers from error accumulation-induced training instability; (3) Introduction of a stabilization strategy mitigating reward sparsity and policy oscillation, enabling efficient online optimization on multiple challenging classification benchmarks. These findings unveil a novel paradigm wherein LLMs operate as implicit reinforcement learning agents.
📝 Abstract
Large Language Models (LLMs) excel at in-context learning (ICL), a supervised learning technique that relies on adding annotated examples to the model context. We investigate a contextual bandit version of in-context reinforcement learning (ICRL), where models learn in-context, online, from external reward, instead of supervised data. We show that LLMs effectively demonstrate such learning, and provide a detailed study of the phenomena, experimenting with challenging classification tasks and models of sizes from 500M to 70B parameters. This includes identifying and addressing the instability of the process, demonstrating learning with both semantic and abstract labels, and showing scaling trends. Our findings highlight ICRL capabilities in LLMs, while also underscoring fundamental limitations in their implicit reasoning about errors.