🤖 AI Summary
This work investigates how test-time compute scaling enhances chain-of-thought (CoT) reasoning in large language models (LLMs). Method: We model CoT inference as a metastable Markov process, revealing that easy and hard reasoning steps correspond respectively to intra-cluster transitions within dense states and inter-cluster jumps across sparse regions—formalizing CoT structure for the first time through the lens of dynamical phase transitions. We theoretically prove that search drastically reduces the expected number of inter-cluster steps and rigorously characterize the upper bound on reasoning capability under local information constraints. Building on this, we propose a novel paradigm integrating sparse-edge reinforcement via RL fine-tuning and metastable distillation, combining Monte Carlo tree search, policy gradient learning, and dynamic graph distillation. Results: Fine-tuned models achieve provably superior policy transfer; distilled lightweight models retain over 92% of original performance on mathematical and logical reasoning benchmarks.
📝 Abstract
A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inference-time compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model or distill its reasoning patterns into more efficient models. In this paper, we study inference-time compute by viewing chain-of-thought (CoT) generation as a metastable Markov process: easy reasoning steps (e.g., algebraic manipulations) form densely connected clusters, while hard reasoning steps (e.g., applying a relevant theorem) create sparse, low-probability edges between clusters, leading to phase transitions at longer timescales. Under this framework, we prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed metastable representation of the reasoning dynamics can be distilled into a smaller, more efficient model.