Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing chain-of-thought (CoT) methods rely on post-hoc, explicit generation of sequential natural language reasoning steps, which cannot be modeled during pretraining and are inherently difficult to parallelize. This work proposes ThoughtBubbles, a novel Transformer variant that implicitly and in parallel models multi-path reasoning in latent space via learnable residual flow branching and pruning mechanisms, enabling unsupervised, adaptive computation expansion during pretraining. Its core innovation lies in modeling “thought bubbles” as dynamic residual operations—unifying reasoning behavior across training and inference while requiring only standard language modeling loss. Experiments demonstrate that ThoughtBubbles reduces perplexity on OpenWebText and peS2o, and achieves significantly stronger zero-shot performance than baselines—including non-adaptive parallel methods—on HellaSwag, LAMBADA, and other reasoning-intensive benchmarks.

Technology Category

Application Category

📝 Abstract

Current approaches for scaling inference-time compute in transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and are limited to only serially-generated, natural-language verbalization to scale inference-time compute. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens that require a large amount of computation can form a "bubble" of cloned residuals in the middle of the network for additional thinking. Crucially, this behavior is learned during pretraining with only language modeling loss. Thoughtbubbles outperforms both standard decoder LMs as well as non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA after pretraining across 150M to 772M parameter scales. The implicit nature of our method enables adaptive computation to be learned starting at pretraining time, paving the way to unify train and test-time behavior for reasoning models.

Problem

Research questions and friction points this paper is trying to address.

Enabling parallel adaptive computation in latent space

Overcoming serial natural-language limitations in transformers

Unifying adaptive computation training and inference phases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel adaptive computation in latent space

Learning to fork or delete residual streams

Implicit adaptive computation during pretraining

🔎 Similar Papers

Towards One Model for Classical Dimensionality Reduction: A Probabilistic Perspective on UMAP and t-SNE