π€ AI Summary
This work addresses the challenge of knowledge fusion among multilingual models in the absence of original training data. We propose a data-agnostic parameter-space weighting framework that optimizes model weights via a prediction-consistency objective, enabling gradient-aware interpolation of heterogeneous models without accessing any training samples. Unlike Fisher-weighted averaging or naive ensembling, our approach jointly preserves individual model performance and fosters collaborative gains, eliminating the need for multi-task joint training. Experiments demonstrate that the fused model significantly outperforms both individual models and conventional ensemble baselines on cross-domain and out-of-domain test sets, while achieving higher training efficiency. To the best of our knowledge, this is the first method to realize efficient and robust model fusion under fully data-unseen conditionsβi.e., without any exposure to task-specific or domain-specific training data.
π Abstract
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.