Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited reasoning capabilities of language models on inherently challenging tasks such as chess, as well as the misalignment between reasoning and decision-making in reinforcement learning (RL). The authors propose a two-stage training framework: first, supervised fine-tuning (SFT) is employed to investigate how multi-step reasoning trajectories versus single-step optimal actions affect reasoning fidelity; second, RL is applied to refine decision-making. The study finds that training with multi-step trajectories enhances reasoning faithfulness and stabilizes RL while preserving downstream task performance. Moreover, several metrics measured during SFT effectively predict post-RL outcomes. Evaluated on a 7B-parameter model, the approach substantially outperforms existing open-source reasoning models, significantly improving move-quality distributions and reducing hallucination rates. The authors release their model, data, and code publicly.
📝 Abstract
How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance -- however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics -- metrics spanning evaluation performance, hallucination rates, and reasoning quality -- to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.
Problem

Research questions and friction points this paper is trying to address.

reasoning
chess
language models
reinforcement learning
hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning evolution
supervised fine-tuning
reinforcement learning
faithful reasoning
hallucination reduction
🔎 Similar Papers
No similar papers found.
L
Lucas Dionisopoulos
University of California, San Diego
N
Nicklas Majamaki
University of California, San Diego
Prithviraj Ammanabrolu
Prithviraj Ammanabrolu
Assistant Professor, University of California, San Diego
Reinforcement LearningNatural Language ProcessingInteractive NarrativeKnowledge Graphs