🤖 AI Summary
This work proposes LaTER, a two-stage reasoning paradigm designed to reduce the computational cost of Chain-of-Thought (CoT) inference while preserving or even enhancing accuracy. LaTER first performs bounded implicit exploration in a continuous latent space and then switches to explicit CoT verification and generation. It is the first approach to integrate implicit exploration with explicit verification without requiring additional training, leveraging the model’s intrinsic latent state structure. The authors introduce Latent-Switch-69K, the first supervised dataset tailored for this paradigm, and develop key techniques including latent state back-projection, KV cache reuse, an entropy- and stop-word-based switching mechanism, and supervised fine-tuning with implicit rollout and halting signals. Evaluated on Qwen3-14B, LaTER reduces token consumption by 16%–32%, improves AIME 2025 accuracy from 70.0% to 73.3%, and achieves 80.0% after fine-tuning—surpassing baselines by 10 percentage points while saving 33% tokens.
📝 Abstract
Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent-Then-Explicit Reasoning (LaTER), a two-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training-free instantiation, LaTER projects final-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model-native stop-token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3-14B, training-free LaTER reduces total token usage by 16%-32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent-Switch-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at https://github.com/TioeAre/LaTER.