🤖 AI Summary
STaR (Self-Taught Reasoner) lacks a formal theoretical foundation—specifically, it remains unclear why STaR consistently improves large language models’ reasoning capabilities despite the absence of high-quality chain-of-thought (CoT) annotations and even in the presence of erroneous reasoning steps.
Method: We formalize CoT as a Markov decision process and integrate reinforcement learning theory to establish a rigorous analytical framework, yielding: (i) pretraining model initialization conditions; (ii) a policy iteration improvement mechanism; (iii) convergence criteria for the optimal reasoning policy; and (iv) a noise-tolerance robustness theorem.
Contribution/Results: We provide the first formal proof that STaR monotonically improves from weak initial policies under verifiable sufficient conditions; we derive necessary and sufficient conditions for convergence to the optimal reasoning policy; and we quantify the generalization error bound under noisy CoT samples. This work establishes the first verifiable, theoretically grounded foundation for self-supervised reasoning.
📝 Abstract
The reasoning abilities of large language models (LLMs) have improved with chain-of-thought (CoT) prompting, allowing models to solve complex tasks stepwise. However, training CoT capabilities requires detailed reasoning data, which is often scarce. The self-taught reasoner (STaR) framework addresses this by using reinforcement learning to automatically generate reasoning steps, reducing reliance on human-labeled data. Although STaR and its variants have demonstrated empirical success, a theoretical foundation explaining these improvements is lacking. This work provides a theoretical framework for understanding the effectiveness of reinforcement learning on CoT reasoning and STaR. Our contributions are: (1) criteria for the quality of pre-trained models necessary to initiate effective reasoning improvement; (2) an analysis of policy improvement, showing why LLM reasoning improves iteratively with STaR; (3) conditions for convergence to an optimal reasoning policy; and (4) an examination of STaR's robustness, explaining how it can improve reasoning even when incorporating occasional incorrect steps; This framework aims to bridge empirical findings with theoretical insights, advancing reinforcement learning approaches for reasoning in LLMs.