Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Large reasoning models (LRMs) suffer from performance degradation and computational waste due to overthinking. Existing routing strategies require parallel deployment of LLMs and LRMs, incurring high cost and low practicality. This paper proposes a training-free, intra-model inference control method: it dynamically analyzes the energy distribution across reasoning paths via singular value decomposition (SVD) and applies low-rank projections to selectively “forget” redundant reasoning steps, enabling adaptive switching between fast and slow thinking modes. To our knowledge, this is the first approach to achieve lightweight, dynamic regulation of inference depth within a single model—effectively mitigating overthinking. Evaluated on diverse structured reasoning tasks, it reduces computational overhead by 37% (average FLOPs) while preserving or even improving accuracy (+0.8%–2.1%). Our core contribution is a zero-training, low-overhead, plug-and-play mechanism for adaptive inference intensity control.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking, degrading performance and wasting resources. One possible baseline is to deploy both LLM and LRM, then route input by predicting whether it requires reasoning and may cause overthinking. However, deploying multiple models can be costly or impractical. We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize inference by switching one model on and off. Instead of routing, we selectively unlearn from LRM at inference, scaling down computation while preserving reasoning. By analyzing the cumulative energy of singular values, we identify optimal low-rank projections to adjust reasoning just right.

Problem

Research questions and friction points this paper is trying to address.

Preventing overthinking in reasoning models to improve performance

Reducing computational costs without deploying multiple models

Selectively adjusting reasoning depth using low-rank projections

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free superposition of slow and fast thinking

Selective unlearning from LRM during inference phase

Low-rank projections optimize computation via singular values

🔎 Similar Papers

No similar papers found.