The Thinking Spectrum: An Emperical Study of Tunable Reasoning in LLMs through Model Merging

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of dynamically adjusting large language models’ (LLMs) reasoning capabilities while balancing accuracy and computational cost, this paper proposes a training-free model weight arithmetic fusion method that systematically constructs a tunable reasoning capability spectrum in multi-weight space. We首次 demonstrate that weight fusion—without additional parameters or training overhead—simultaneously improves both reasoning accuracy and token efficiency, achieving Pareto-optimal trade-offs. Extensive evaluation across multiple reasoning benchmarks, coupled with accuracy–efficiency curve analysis, confirms substantial performance gains and reduced token consumption on tasks including mathematical reasoning and commonsense question answering. This work establishes a quantifiable, continuously adjustable paradigm for LLM reasoning capability and provides a lightweight, deployment-oriented guideline for capability calibration.

Technology Category

Application Category

📝 Abstract
The growing demand for large language models (LLMs) with tunable reasoning capabilities in many real-world applications highlights a critical need for methods that can efficiently produce a spectrum of models balancing reasoning depth and computational cost. Model merging has emerged as a promising, training-free technique to address this challenge by arithmetically combining the weights of a general-purpose model with a specialized reasoning model. While various merging techniques exist, their potential to create a spectrum of models with fine-grained control over reasoning abilities remains largely unexplored. This work presents a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks. We systematically vary merging strengths to construct accuracy-efficiency curves, providing the first comprehensive view of the tunable performance landscape. Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency, even when parent models have highly divergent weight spaces. Crucially, we identify instances of Pareto Improvement, where a merged model achieves both higher accuracy and lower token consumption than one of its parents. Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles to meet diverse application demands.
Problem

Research questions and friction points this paper is trying to address.

Creating LLMs with tunable reasoning depth and cost
Evaluating model merging techniques for reasoning control
Calibrating trade-off between reasoning accuracy and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model merging combines general and specialized model weights
Varying merging strength controls reasoning accuracy and efficiency
Achieves Pareto improvements in accuracy and token consumption
🔎 Similar Papers
No similar papers found.
Xiaochong Lan
Xiaochong Lan
Tsinghua University
Large Language ModelsLLM Agent
Y
Yu Zheng
Massachusetts Institute of Technology, Cambridge, MA USA
S
Shiteng Cao
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Y
Yong Li
Tsinghua University, Beijing, China