Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of limited computational resources during inference for large language models. We propose an implicit latent-space recursive reasoning architecture that dynamically extends reasoning depth at inference time—without increasing token generation length—by iteratively updating latent representations under fixed output constraints. Our method eliminates reliance on explicit chain-of-thought annotations and long-context dependencies, instead employing a lightweight recursive module to unfold multi-step reasoning directly in the latent space, enabling efficient deployment with small context windows. The core contribution is the first implicit depth-unfolding mechanism, capable of modeling ineffable (inexpressible) reasoning processes. Evaluated on multiple reasoning benchmarks, the model—trained with 3.5B parameters on 800B tokens—significantly outperforms same-scale baselines, achieving computational efficiency gains equivalent to a 50B-parameter model. These results empirically validate both the effectiveness and scalability of latent-space deep reasoning.

Technology Category

Application Category

📝 Abstract

We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.

Problem

Research questions and friction points this paper is trying to address.

Scaling test-time computation efficiently

Enhancing reasoning without more tokens

Capturing complex reasoning in latent space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent space reasoning architecture

Recurrent block iteration method

Scalable without specialized training data

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time