Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the limitations of large language models (LLMs)—including constrained reasoning capability, catastrophic forgetting during training, and heavy reliance on large-scale annotated data—this paper proposes a parameter-free, supervision-agnostic test-time inference enhancement method: Test-Time Implicit-space Adaptation (TTIA). TTIA performs instance-level policy gradient optimization in the latent space, guided by self-supervised reward modeling to iteratively refine hidden states over multiple steps. Its core innovation lies in the first introduction of test-time computation scaling into the latent space, enabling lightweight, scalable, and instance-adaptive reasoning without architectural or parameter modifications. Evaluated on GSM8K, MATH-500, and AIME2024, TTIA significantly outperforms chain-of-thought prompting and supervised fine-tuning baselines. It achieves convergence within only 3–5 iterations on average, demonstrating the effectiveness, efficiency, and generalizability of latent-space test-time scaling.

Technology Category

Application Category

📝 Abstract

Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning via test-time latent space adaptation

Addressing catastrophic forgetting and limited training data in LLMs

Improving reasoning performance without parameter updates using policy gradient

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages latent space for test-time reasoning

Uses policy gradient for iterative latent updates

Self-generated reward signals guide adaptation

🔎 Similar Papers

A Role of Environmental Complexity on Representation Learning in Deep Reinforcement Learning Agents