BTS: Harmonizing Specialized Experts into a Generalist LLM

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
How to efficiently integrate multi-domain expert language models while preserving both generalization and domain specialization? This paper proposes the Branch-Train-Stitch (BTS) algorithm: first, domain-specific experts (e.g., programming, mathematics) are pretrained independently; then, all experts and the seed backbone are frozen, and lightweight, learnable “stitch” layers are introduced to enable plug-and-play dynamic routing over expert representations. BTS supports zero-shot addition or removal of experts without fine-tuning, ensuring modularity and cross-domain generalization. After fine-tuning on small-scale mixed-domain data, BTS significantly outperforms existing model-fusion approaches across diverse downstream tasks, while fully retaining each expert’s original domain performance—marking the first method to jointly enhance both general-purpose and specialized capabilities.

Technology Category

Application Category

📝 Abstract
We present Branch-Train-Stitch (BTS), an efficient and flexible training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model. Following Li et al., we start with a single seed language model which is branched into domain-specific (e.g., coding or math) experts with continual pretraining. BTS combines experts into a generalist model using lightweight stitch layers, which are inserted between frozen experts and the seed LLM, and trained on a small datamix of the expert domains. Stitch layers enable the seed LLM to integrate representations from any number of experts during the forward pass, allowing it to generalize to new domains, despite remaining frozen. Because BTS does not alter the constituent LLMs, BTS provides a modular and flexible approach: experts can be easily removed and new experts can be added with only a small amount of training. Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks, retaining the specialized capabilities of each of the experts.
Problem

Research questions and friction points this paper is trying to address.

Multi-specialty Language Models
Expertise Retention
Supermodel Integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Branch-Train-Stitch
Stitch layer
domain-specific expertise
🔎 Similar Papers
No similar papers found.