Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limited reasoning capabilities of language models on inherently challenging tasks such as chess, as well as the misalignment between reasoning and decision-making in reinforcement learning (RL). The authors propose a two-stage training framework: first, supervised fine-tuning (SFT) is employed to investigate how multi-step reasoning trajectories versus single-step optimal actions affect reasoning fidelity; second, RL is applied to refine decision-making. The study finds that training with multi-step trajectories enhances reasoning faithfulness and stabilizes RL while preserving downstream task performance. Moreover, several metrics measured during SFT effectively predict post-RL outcomes. Evaluated on a 7B-parameter model, the approach substantially outperforms existing open-source reasoning models, significantly improving move-quality distributions and reducing hallucination rates. The authors release their model, data, and code publicly.

Technology Category

Application Category

📝 Abstract

How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance -- however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics -- metrics spanning evaluation performance, hallucination rates, and reasoning quality -- to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.

Problem

Research questions and friction points this paper is trying to address.

reasoning

chess

language models

reinforcement learning

hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning evolution

supervised fine-tuning

reinforcement learning