Continuous Chain of Thought Enables Parallel Exploration and Reasoning

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Discrete chain-of-thought (CoT) reasoning suffers from limited expressivity and serial inefficiency due to token-level discrete sampling. Method: We propose continuous chain-of-thought (CoT2), modeling reasoning trajectories as continuous latent-variable optimization—bypassing discrete token generation. CoT2 integrates a single-layer Transformer with theoretical guarantees for exact solving of NP problems (e.g., Subset Sum); introduces continuous exploration over the probability simplex, K-token parallel combinatorial decoding, softmax-distribution matching supervision, and policy-optimization-driven self-improving training. Contribution/Results: Experiments demonstrate that CoT2 significantly outperforms standard CoT on logical reasoning benchmarks, achieving superior expressivity, inherent parallelism, and differentiable search efficiency. Our results validate the effectiveness and scalability of continuous reasoning as a novel paradigm for neuro-symbolic inference.

Technology Category

Application Category

📝 Abstract
Current language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary. While this discrete sampling has achieved remarkable success, conducting chain-of-thought with continuously-valued tokens (CoT2) offers a richer and more expressive alternative. Our work examines the benefits of CoT2 through logical reasoning tasks that inherently require search capabilities and provide optimization and exploration methods for CoT2. Theoretically, we show that CoT2 allows the model to track multiple traces in parallel and quantify its benefits for inference efficiency. Notably, one layer transformer equipped with CoT2 can provably solve the combinatorial"subset sum problem"given sufficient embedding dimension. These insights lead to a novel and effective supervision strategy where we match the softmax outputs to the empirical token distributions of a set of target traces. Complementing this, we introduce sampling strategies that unlock policy optimization and self-improvement for CoT2. Our first strategy samples and composes $K$ discrete tokens at each decoding step to control the level of parallelism, and reduces to standard CoT when $K=1$. Our second strategy relies on continuous exploration over the probability simplex. Experiments confirm that policy optimization with CoT2 indeed improves the performance of the model beyond its initial discrete or continuous supervision.
Problem

Research questions and friction points this paper is trying to address.

Exploring benefits of continuous-valued tokens for reasoning tasks
Enabling parallel reasoning traces with CoT2 for efficiency
Developing supervision and sampling strategies for CoT2 optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous-valued tokens enhance reasoning expressiveness
Parallel trace tracking boosts inference efficiency
Novel sampling strategies enable policy optimization
🔎 Similar Papers
No similar papers found.
H
Halil Alperen Gozeten
University of Michigan - Ann Arbor
M
M. E. Ildiz
University of Michigan - Ann Arbor
X
Xuechen Zhang
University of Michigan - Ann Arbor
Hrayr Harutyunyan
Hrayr Harutyunyan
Research Scientist, Google DeepMind
Large Language ModelsPre-trainingPost-trainingLearning TheoryInformation Theory
A
A. Rawat
Google Research NYC
Samet Oymak
Samet Oymak
University of Michigan | Google Research
machine learningdecision makingstatisticsoptimizationlanguage models