🤖 AI Summary
This work addresses the limitations of traditional chain-of-thought methods, which incur high computational costs in discrete token spaces and often converge to a single reasoning path, as well as existing implicit reasoning approaches that rely on fixed step counts and lack dynamic termination mechanisms. The authors reformulate implicit reasoning as a planning process with adaptive termination by decoupling reasoning from language generation. Reasoning occurs in a continuous latent space via deterministic state trajectories, while a separate decoder produces textual outputs on demand. This approach is the first to enable adaptive reasoning length, substantially enhancing reasoning diversity and scalability. Experimental results show that, although greedy accuracy is slightly lower, the model explores a broader solution space, providing a transparent and flexible foundation for search during inference.
📝 Abstract
Chain-of-Thought (CoT) empowers Large Language Models (LLMs) to tackle complex problems, but remains constrained by the computational cost and reasoning path collapse when grounded in discrete token spaces. Recent latent reasoning approaches attempt to optimize efficiency by performing reasoning within continuous hidden states. However, these methods typically operate as opaque end-to-end mappings from explicit reasoning steps to latent states, and often require a pre-defined number of latent steps during inference. In this work, we introduce PLaT (Planning with Latent Thoughts), a framework that reformulates latent reasoning as planning by fundamentally decouple reasoning from verbalization. We model reasoning as a deterministic trajectory of latent planning states, while a separate Decoder grounds these thoughts into text when necessary. This decoupling allows the model to dynamically determine when to terminate reasoning rather than relying on fixed hyperparameters. Empirical results on mathematical benchmarks reveal a distinct trade-off: while PLaT achieves lower greedy accuracy than baselines, it demonstrates superior scalability in terms of reasoning diversity. This indicates that PLaT learns a robust, broader solution space, offering a transparent and scalable foundation for inference-time search. Our code can be found in https://github.com/yunsaijc/PLaT.