🤖 AI Summary
This study addresses the lack of systematic investigation into the reasoning capabilities of small-scale language models in theoretical physics domains such as quantum field theory, compounded by the scarcity of high-quality, verifiable domain-specific data. The work presents the first supervised fine-tuning (SFT) and reinforcement learning (RL) experiments on 7B-parameter reasoning models, employing a hybrid dataset combining synthetically generated and human-adapted examples. It introduces a reproducible pipeline for data generation and adaptation, releasing approximately 200 million tokens of reasoning trajectories and over 2,500 high-quality physics problems. Analysis of reasoning chains before and after training reveals substantial performance gains on quantum field theory tasks and demonstrates promising generalization to other areas of theoretical physics.
📝 Abstract
Despite the growing application of Large Language Models (LLMs) to theoretical physics, there is little academic exploration into how domain-specific physics reasoning ability develops while training these models. To investigate this, we perform the first academic fine-tuning study of small (7B-parameter) reasoning models dedicated specifically to theoretical physics. Because open-source verifiable training data required to train such capabilities is scarce, we developed a robust data generation pipeline that can both create synthetic problems and make existing human-authored problems suitable for model training. Selecting Quantum Field Theory (QFT) as our primary domain, we generated over 2,500 synthetic problems alongside a curated collection of human-adapted problems sourced from arXiv and standard pedagogical resources. We conduct both Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) experiments, benchmarking performance gains as well as generalization to other physics domains. We perform an extensive analysis of model chains-of-though before and after fine-tuning, to understand how reasoning errors evolve during RL and SFT. Finally, we publicly release our data pipeline, verifiable QFT training data, and $\sim$200M tokens of QFT reasoning traces.