🤖 AI Summary
To address the challenge of dynamically adjusting large language models’ (LLMs) reasoning capabilities while balancing accuracy and computational cost, this paper proposes a training-free model weight arithmetic fusion method that systematically constructs a tunable reasoning capability spectrum in multi-weight space. We首次 demonstrate that weight fusion—without additional parameters or training overhead—simultaneously improves both reasoning accuracy and token efficiency, achieving Pareto-optimal trade-offs. Extensive evaluation across multiple reasoning benchmarks, coupled with accuracy–efficiency curve analysis, confirms substantial performance gains and reduced token consumption on tasks including mathematical reasoning and commonsense question answering. This work establishes a quantifiable, continuously adjustable paradigm for LLM reasoning capability and provides a lightweight, deployment-oriented guideline for capability calibration.
📝 Abstract
The growing demand for large language models (LLMs) with tunable reasoning capabilities in many real-world applications highlights a critical need for methods that can efficiently produce a spectrum of models balancing reasoning depth and computational cost. Model merging has emerged as a promising, training-free technique to address this challenge by arithmetically combining the weights of a general-purpose model with a specialized reasoning model. While various merging techniques exist, their potential to create a spectrum of models with fine-grained control over reasoning abilities remains largely unexplored. This work presents a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks. We systematically vary merging strengths to construct accuracy-efficiency curves, providing the first comprehensive view of the tunable performance landscape. Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency, even when parent models have highly divergent weight spaces. Crucially, we identify instances of Pareto Improvement, where a merged model achieves both higher accuracy and lower token consumption than one of its parents. Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles to meet diverse application demands.