GLUE: Gradient-free Learning to Unify Experts

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the low initialization efficiency and poor accuracy of target-domain generalization in multi-expert model ensembles, this paper proposes a gradient-free two-step Simultaneous Perturbation Stochastic Approximation (SPSA) method that learns expert mixture weights using only two forward passes. Departing from conventional heuristic weighting schemes—such as data-size-based or proxy-metric-based selection—the method employs convex combination initialization coupled with zero-gradient constraints under stochastic approximation optimization. Evaluated across three datasets and three model architectures, it achieves substantial improvements: up to +8.5% over data-size weighting and +9.1% over proxy-metric selection, while attaining accuracy within ≤1.4% of full-gradient-based mixture optimization. This approach significantly reduces computational overhead and enhances cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract

In many deployed systems (multilingual ASR, cross-hospital imaging, region-specific perception), multiple pretrained specialist models coexist. Yet, new target domains often require domain expansion: a generalized model that performs well beyond any single specialist's domain. Given such a new target domain, prior works seek a single strong initialization prior for the model parameters by first blending expert models to initialize a target model. However, heuristic blending -- using coefficients based on data size or proxy metrics -- often yields lower target-domain test accuracy, and learning the coefficients on the target loss typically requires computationally-expensive full backpropagation through the network. We propose GLUE, Gradient-free Learning To Unify Experts, which initializes the target model as a convex combination of fixed experts, learning the mixture coefficients of this combination via a gradient-free two-point (SPSA) update that requires only two forward passes per step. Across experiments on three datasets and three network architectures, GLUE produces a single prior that can be fine-tuned effectively to outperform baselines. GLUE improves test accuracy by up to 8.5% over data-size weighting and by up to 9.1% over proxy-metric selection. GLUE either outperforms backpropagation-based full-gradient mixing or matches its performance within 1.4%.

Problem

Research questions and friction points this paper is trying to address.

Optimizes expert model blending without gradient computation

Enhances target-domain accuracy via gradient-free coefficient learning

Improves initialization for fine-tuning across multiple domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

GLUE uses gradient-free two-point updates for mixture coefficients

It initializes target model as convex combination of fixed experts

Requires only two forward passes per step for learning

🔎 Similar Papers

No similar papers found.