🤖 AI Summary
Current large language models heavily rely on reward-driven reinforcement learning, which constrains their flexibility and generalization capabilities. Inspired by the psychological theory of latent learning, this work proposes a two-stage post-training paradigm: first, the model explores in a reward-free environment to construct task-relevant knowledge representations; subsequently, reward signals are introduced to optimize performance. This study presents the first validation and application of latent learning mechanisms in large language models. Extensive experiments across multiple models and tasks demonstrate that the proposed approach consistently outperforms baselines that depend entirely on reward-based reinforcement throughout training, thereby confirming the effectiveness and broad applicability of latent learning in enhancing model performance.
📝 Abstract
Latent learning, classically theorized by Tolman, shows that biological agents (e.g., rats) can acquire internal representations of their environment without rewards, enabling rapid adaptation once rewards are introduced. In contrast, from a cognitive science perspective, reward learning remains overly dependent on external feedback, limiting flexibility and generalization. Although recent advances in the reasoning capabilities of large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, mark a significant breakthrough, these models still rely primarily on reward-centric reinforcement learning paradigms. Whether and how the well-established phenomenon of latent learning in psychology can inform or emerge within LLMs'training remains largely unexplored. In this work, we present novel findings from our experiments that LLMs also exhibit the latent learning dynamics. During an initial phase of unrewarded exploration, LLMs display modest performance improvements, as this phase allows LLMs to organize task-relevant knowledge without being constrained by reward-driven biases, and performance is further enhanced once rewards are introduced. LLMs post-trained under this two-stage exploration regime ultimately achieve higher competence than those post-trained with reward-based reinforcement learning throughout. Beyond these empirical observations, we also provide theoretical analyses for our experiments explaining why unrewarded exploration yields performance gains, offering a mechanistic account of these dynamics. Specifically, we conducted extensive experiments across multiple model families and diverse task domains to establish the existence of the latent learning dynamics in LLMs.