Fusing LLM Capabilities with Routing Data

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches predominantly rely on a single large language model (LLM), struggling to simultaneously achieve high performance and cost efficiency on complex tasks. Method: We propose FusionFactory—the first systematic LLM fusion framework—that identifies the relative strengths of diverse LLMs across heterogeneous tasks via model routing analysis, and introduces a three-tier fusion mechanism operating at the query, reasoning template, and model levels. Leveraging routing-driven data mining, abstract reasoning template extraction, inference-augmented output generation, and knowledge distillation, we construct high-quality multi-model response–score training data. Contribution/Results: We release FusionBench, a comprehensive, multi-domain benchmark for rigorous evaluation. Experiments demonstrate that FusionFactory consistently outperforms the best individual LLM across all 14 benchmark tasks, achieving significant performance gains while reducing token consumption—thereby validating the effectiveness and scalability of systematic LLM fusion.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has created a vibrant ecosystem of diverse architectures, each with unique strengths due to differences in design, training data, and objectives. However, most applications still rely on a single backend model, limiting coverage of capabilities and leading to inefficiencies in performance and token cost when tackling complex tasks. We highlight an underexploited opportunity: LLM routing data, produced when hosting platforms route diverse queries to different models, which can reveal comparative strengths across tasks. To address this, we propose FusionBench, a comprehensive routing benchmark covering 14 tasks across five domains with 20 open-source LLMs (8B to 671B parameters), capturing 103M tokens and summarizing reusable thought templates from top models. Building on this, we introduce FusionFactory, a systematic fusion framework with three levels: (1) query-level fusion, tailoring routers for each query using both direct responses and reasoning-augmented outputs; (2) thought-level fusion, leveraging abstract templates derived from top-performing LLMs' answers to similar queries; and (3) model-level fusion, transferring capabilities between models via distillation, using top responses or highest judge scores as training data. Experiments show FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with optimal fusion configurations varying by benchmark, demonstrating the value of systematic LLM fusion in harnessing complementary strengths and improving overall performance.
Problem

Research questions and friction points this paper is trying to address.

Leveraging LLM routing data to optimize model selection for diverse queries
Creating a benchmark for systematic fusion of multiple LLM capabilities
Improving performance by combining query, thought, and model-level fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

FusionBench benchmark with 20 LLMs
FusionFactory framework with three fusion levels
Query, thought, and model-level fusion strategies
🔎 Similar Papers
No similar papers found.