🤖 AI Summary
This work addresses the limited reasoning capabilities of language models on inherently challenging tasks such as chess, as well as the misalignment between reasoning and decision-making in reinforcement learning (RL). The authors propose a two-stage training framework: first, supervised fine-tuning (SFT) is employed to investigate how multi-step reasoning trajectories versus single-step optimal actions affect reasoning fidelity; second, RL is applied to refine decision-making. The study finds that training with multi-step trajectories enhances reasoning faithfulness and stabilizes RL while preserving downstream task performance. Moreover, several metrics measured during SFT effectively predict post-RL outcomes. Evaluated on a 7B-parameter model, the approach substantially outperforms existing open-source reasoning models, significantly improving move-quality distributions and reducing hallucination rates. The authors release their model, data, and code publicly.
📝 Abstract
How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance -- however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics -- metrics spanning evaluation performance, hallucination rates, and reasoning quality -- to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.