Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) often exhibit “overthinking” on complex tasks—generating excessively long, inefficient reasoning chains with marginal quality gains. Method: This paper introduces the first systematic exploration of model fusion–driven Long-to-Short (L2S) reasoning, proposing three fusion strategies—task-vector fusion, SVD-based fusion, and activation-aware fusion—evaluated across 1.5B–32B LLMs. Contribution/Results: We find model scale critically impacts fusion efficacy; fused models inherently support self-critique, self-correction, and adaptive response-length control per task. Empirically, L2S reduces average response length by 55% while maintaining or improving reasoning quality. It demonstrates strong efficiency, stability, and generalization across diverse multitask benchmarks. This work provides a scalable, low-overhead pathway to reconcile System 1–style speed with System 2–style depth in LLM inference.

Technology Category

Application Category

📝 Abstract
The transition from System 1 to System 2 reasoning in large language models (LLMs) has marked significant advancements in handling complex tasks through deliberate, iterative thinking. However, this progress often comes at the cost of efficiency, as models tend to overthink, generating redundant reasoning steps without proportional improvements in output quality. Long-to-Short (L2S) reasoning has emerged as a promising solution to this challenge, aiming to balance reasoning depth with practical efficiency. While existing approaches, such as supervised fine-tuning (SFT), reinforcement learning (RL), and prompt engineering, have shown potential, they are either computationally expensive or unstable. Model merging, on the other hand, offers a cost-effective and robust alternative by integrating the quick-thinking capabilities of System 1 models with the methodical reasoning of System 2 models. In this work, we present a comprehensive empirical study on model merging for L2S reasoning, exploring diverse methodologies, including task-vector-based, SVD-based, and activation-informed merging. Our experiments reveal that model merging can reduce average response length by up to 55% while preserving or even improving baseline performance. We also identify a strong correlation between model scale and merging efficacy with extensive evaluations on 1.5B/7B/14B/32B models. Furthermore, we investigate the merged model's ability to self-critique and self-correct, as well as its adaptive response length based on task complexity. Our findings highlight model merging as a highly efficient and effective paradigm for L2S reasoning, offering a practical solution to the overthinking problem while maintaining the robustness of System 2 reasoning. This work can be found on Github https://github.com/hahahawu/Long-to-Short-via-Model-Merging.
Problem

Research questions and friction points this paper is trying to address.

Balancing reasoning depth with efficiency in LLMs
Reducing redundant reasoning steps without quality loss
Integrating System 1 and System 2 model capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model merging balances System 1 and System 2 reasoning
Reduces response length by 55% without performance loss
Scales efficiently with model size from 1.5B to 32B
🔎 Similar Papers
No similar papers found.
H
Han Wu
Huawei Noah’s Ark Lab
Yuxuan Yao
Yuxuan Yao
City University of Hong Kong
LLMDecodingReasoningModel Merging
S
Shuqi Liu
Huawei Noah’s Ark Lab
Z
Zehua Liu
Huawei Noah’s Ark Lab
X
Xiaojin Fu
Huawei Noah’s Ark Lab
Xiongwei Han
Xiongwei Han
AI&OR Principal Researcher at Noah's Ark Lab, Huawei
Intelligence ModelingLLMs for OR
X
Xing Li
Huawei Noah’s Ark Lab
Hui-Ling Zhen
Hui-Ling Zhen
Huawei, Hong Kong
LLM InferenceAgentNumerical OptimizationNumerical Computation
T
Tao Zhong
Huawei Noah’s Ark Lab
M
Mingxuan Yuan
Huawei Noah’s Ark Lab