LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

This work proposes LaTER, a two-stage reasoning paradigm designed to reduce the computational cost of Chain-of-Thought (CoT) inference while preserving or even enhancing accuracy. LaTER first performs bounded implicit exploration in a continuous latent space and then switches to explicit CoT verification and generation. It is the first approach to integrate implicit exploration with explicit verification without requiring additional training, leveraging the model’s intrinsic latent state structure. The authors introduce Latent-Switch-69K, the first supervised dataset tailored for this paradigm, and develop key techniques including latent state back-projection, KV cache reuse, an entropy- and stop-word-based switching mechanism, and supervised fine-tuning with implicit rollout and halting signals. Evaluated on Qwen3-14B, LaTER reduces token consumption by 16%–32%, improves AIME 2025 accuracy from 70.0% to 73.3%, and achieves 80.0% after fine-tuning—surpassing baselines by 10 percentage points while saving 33% tokens.

📝 Abstract

Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent-Then-Explicit Reasoning (LaTER), a two-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training-free instantiation, LaTER projects final-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model-native stop-token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3-14B, training-free LaTER reduces total token usage by 16%-32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent-Switch-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at https://github.com/TioeAre/LaTER.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-thought reasoning

test-time efficiency

token reduction

latent reasoning

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Reasoning

Chain-of-Thought

Test-Time Efficiency