🤖 AI Summary
This work addresses the substantial I/O overhead incurred by expert weight loading during large language model (LLM) merging, which has emerged as a critical performance bottleneck. The study formulates model merging for the first time as a budget-constrained expert access set selection problem and introduces MergePipe, an execution framework that leverages parameter block indexing, deterministic access scheduling, and a replayable manifest mechanism to enable efficient and reproducible merging within strict I/O budgets. The authors also derive a theoretical bound on the error introduced by omitted updates. Experimental results on Qwen and Llama merging tasks demonstrate up to an order-of-magnitude reduction in I/O costs, speedups of up to 11×, parameter deviation as low as 10⁻³, and no significant degradation in downstream task performance.
📝 Abstract
Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale the limiting resource is often the set of expert weights that must be read. We introduce MergePipe, a budget-aware execution layer that casts LLM merging as an \emph{expert access-set} problem: given a merge operator and a checkpoint family in a shared weight coordinate system, choose which expert delta blocks to access under an explicit I/O budget. MergePipe indexes parameter blocks, builds deterministic access plans, and executes the induced budgeted merge with replayable manifests. The plan is budget-sound by construction and recovers the full-read merge at full budget; for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of omitted deltas. Across Qwen and Llama merging workloads, MergePipe reduces expert-read I/O by up to an order of magnitude and achieves up to $11\times$ speedups. Representative budget sweeps show $O(10^{-3})$ parameter deviation from full-read merges and no monotonic degradation on downstream benchmarks.