🤖 AI Summary
This work addresses the challenge of limited computational resources during inference for large language models. We propose an implicit latent-space recursive reasoning architecture that dynamically extends reasoning depth at inference time—without increasing token generation length—by iteratively updating latent representations under fixed output constraints. Our method eliminates reliance on explicit chain-of-thought annotations and long-context dependencies, instead employing a lightweight recursive module to unfold multi-step reasoning directly in the latent space, enabling efficient deployment with small context windows. The core contribution is the first implicit depth-unfolding mechanism, capable of modeling ineffable (inexpressible) reasoning processes. Evaluated on multiple reasoning benchmarks, the model—trained with 3.5B parameters on 800B tokens—significantly outperforms same-scale baselines, achieving computational efficiency gains equivalent to a 50B-parameter model. These results empirically validate both the effectiveness and scalability of latent-space deep reasoning.
📝 Abstract
We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.