Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation

📅 2025-02-02

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work investigates how test-time compute scaling enhances chain-of-thought (CoT) reasoning in large language models (LLMs). Method: We model CoT inference as a metastable Markov process, revealing that easy and hard reasoning steps correspond respectively to intra-cluster transitions within dense states and inter-cluster jumps across sparse regions—formalizing CoT structure for the first time through the lens of dynamical phase transitions. We theoretically prove that search drastically reduces the expected number of inter-cluster steps and rigorously characterize the upper bound on reasoning capability under local information constraints. Building on this, we propose a novel paradigm integrating sparse-edge reinforcement via RL fine-tuning and metastable distillation, combining Monte Carlo tree search, policy gradient learning, and dynamic graph distillation. Results: Fine-tuned models achieve provably superior policy transfer; distilled lightweight models retain over 92% of original performance on mathematical and logical reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inference-time compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model or distill its reasoning patterns into more efficient models. In this paper, we study inference-time compute by viewing chain-of-thought (CoT) generation as a metastable Markov process: easy reasoning steps (e.g., algebraic manipulations) form densely connected clusters, while hard reasoning steps (e.g., applying a relevant theorem) create sparse, low-probability edges between clusters, leading to phase transitions at longer timescales. Under this framework, we prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed metastable representation of the reasoning dynamics can be distilled into a smaller, more efficient model.

Problem

Research questions and friction points this paper is trying to address.

Enhance reasoning in large language models.

Optimize search protocols for sparse edges.

Distill efficient reasoning models from search.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Search protocol rewards sparse edges

Policy gradient fine-tunes pretrained model

Distilled metastable reasoning enhances efficiency

🔎 Similar Papers

Rational Metareasoning for Large Language Models