🤖 AI Summary
This work investigates whether large language models (LLMs) can acquire strategic reasoning capabilities in chess via reinforcement learning (RL). It identifies an inherent limitation: pretrained LLMs exhibit fundamental deficits in strategic understanding, severely constraining RL performance gains. To address this, the authors propose a knowledge distillation–based dense reward mechanism leveraging an action-value network—using high-accuracy chess engines to generate fine-grained, action-level feedback, integrated with supervised fine-tuning and RL policy optimization. Experiments demonstrate that dense rewards substantially outperform sparse rewards; however, all LLM variants remain markedly inferior to human experts, exposing a fundamental bottleneck of the pretraining paradigm in deep strategic modeling. The core contributions are (1) the first systematic empirical characterization of the RL plasticity boundary for strategic reasoning in LLMs, and (2) a transferable, distillation-augmented RL framework for strategic skill acquisition.
📝 Abstract
While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM's output move quality, which can be seen as a form of knowledge distillation. Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels. We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models' internal understanding of chess--a deficit which RL alone may not be able to fully overcome.