Dataless Knowledge Fusion by Merging Weights of Language Models

📅 2022-12-19

🏛️ International Conference on Learning Representations

📈 Citations: 215

✨ Influential: 44

career value

184K/year

🤖 AI Summary

This work addresses the challenge of knowledge fusion among multilingual models in the absence of original training data. We propose a data-agnostic parameter-space weighting framework that optimizes model weights via a prediction-consistency objective, enabling gradient-aware interpolation of heterogeneous models without accessing any training samples. Unlike Fisher-weighted averaging or naive ensembling, our approach jointly preserves individual model performance and fosters collaborative gains, eliminating the need for multi-task joint training. Experiments demonstrate that the fused model significantly outperforms both individual models and conventional ensemble baselines on cross-domain and out-of-domain test sets, while achieving higher training efficiency. To the best of our knowledge, this is the first method to realize efficient and robust model fusion under fully data-unseen conditions—i.e., without any exposure to task-specific or domain-specific training data.

📝 Abstract

Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.

Problem

Research questions and friction points this paper is trying to address.

Merge language models without original training data

Improve model performance across diverse data domains

Enable efficient knowledge fusion without data privacy risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Merges language models in parameter space

Minimizes prediction differences via weight guidance

Efficient alternative to multi-task learning

🔎 Similar Papers

Large Language Model Enhanced Knowledge Representation Learning: A Survey