Fast Thinking for Large Language Models

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Explicit chain-of-thought (CoT) reasoning in large language models (LLMs) incurs high latency and excessive token consumption. Method: We propose a synergistic framework comprising Latent Codebooks and GainRouter. First, we compress diverse reasoning strategies into compact discrete priors via implicit CoT distillation and discrete policy codebook learning. Second, we represent implicit reasoning states using continuous thought vectors and design a lightweight GainRouter to dynamically decide whether explicit reasoning steps are necessary, thereby suppressing redundant computation. Third, codebook-constrained conditional generation enables efficient, adaptive switching between implicit and explicit reasoning paths. Contribution/Results: Our method achieves state-of-the-art or competitive accuracy across multiple reasoning benchmarks while reducing average inference latency by 32% and output token count by 41%. It is the first approach to jointly model discrete strategy priors and continuous thought representations, significantly enhancing both efficiency and controllability of LLM inference.

Technology Category

Application Category

📝 Abstract
Reasoning-oriented Large Language Models (LLMs) often rely on generating explicit tokens step by step, and their effectiveness typically hinges on large-scale supervised fine-tuning or reinforcement learning. While Chain-of-Thought (CoT) techniques substantially enhance performance on complex reasoning tasks, they remain inefficient, requiring long reasoning traces that increase latency and token usage. In this work, we introduce Latent Codebooks for Fast Thinking, a framework that uses concise CoT sketches only during training to learn a codebook of discrete strategy priors. At inference, the model conditions on a handful of continuous thinking vectors distilled from the codebook in a single pass, enabling strategy-level guidance without producing explicit reasoning tokens. To complement this design, we propose GainRouter, a lightweight routing mechanism that adaptively switches between fast codebook guided inference and slow explicit reasoning, thereby suppressing overthinking and reducing unnecessary token generation. Experiments across multiple reasoning benchmarks show that our approach achieves competitive or superior accuracy while substantially lowering inference cost, offering a practical path toward efficient and controllable reasoning in large language models.
Problem

Research questions and friction points this paper is trying to address.

Reducing reasoning latency and token usage
Learning discrete strategy priors without explicit tokens
Adaptively switching between fast and slow reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent codebooks enable single-pass strategy guidance
GainRouter adaptively switches reasoning modes
Concise training sketches distill continuous thinking vectors
🔎 Similar Papers
No similar papers found.