Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Training large language models (LLMs) incurs prohibitive computational costs, and while model souping offers a training-free performance boost, conventional uniform averaging ignores heterogeneity in model capabilities across task categories. Method: We propose Soup of Category Experts (SoCE), which first clusters models based on multi-dimensional benchmarking—spanning multilingual understanding, tool use, mathematical reasoning, and more—to identify domain-specialized experts; it then exploits low inter-category performance correlation to design a non-uniform, task-category-aware weighting scheme for dynamic ensemble fusion. Contribution/Results: SoCE requires no additional training or fine-tuning, yet achieves state-of-the-art performance on the Berkeley Function Calling Leaderboard. It significantly enhances cross-domain robustness and overall capability, establishing a novel, efficient paradigm for LLM ensembling that leverages specialized expertise without increasing training overhead.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.
Problem

Research questions and friction points this paper is trying to address.

Reducing resource-intensive LLM training requirements
Optimizing model souping through non-uniform weighted averaging
Improving performance across multilingual, tool calling, and math domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted averaging of expert models
Non-uniform averaging based on categories
Optimized weights for performance enhancement
🔎 Similar Papers
No similar papers found.