Weighted-Reward Preference Optimization for Implicit Model Fusion

๐Ÿ“… 2024-12-04
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses core challenges in fusing heterogeneous open-source large language models (LLMs): inconsistent vocabularies, misaligned parameter matrices, and distributional shift-induced noise accumulation. To this end, we propose WRPO (Weighted Reward Preference Optimization), an implicit fusion method that bypasses explicit vocabulary alignment or parameter merging. Our contributions are threefold: (i) the first preference-optimization-based paradigm for implicit capability transfer; (ii) a progressive adaptation strategy to mitigate distributional shift between source and target models; and (iii) integration of multi-stage policy distillation with weighted reward modeling. On AlpacaEval-2, WRPO built upon LLaMA3-8B-Instruct achieves a 55.9% win rate against GPT-4-Preview-1106 and attains a 46.2% win rate on Arena-Hard versus GPT-4-0314โ€”outperforming existing fusion and fine-tuning approaches by a significant margin.

Technology Category

Application Category

๐Ÿ“ Abstract
While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. Our code is available at https://github.com/SLIT-AI/WRPO.
Problem

Research questions and friction points this paper is trying to address.

Fuses heterogeneous LLMs without vocabulary alignment.
Optimizes preference transfer between source and target LLMs.
Introduces progressive adaptation to reduce distributional deviations.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted-Reward Preference Optimization
Progressive adaptation strategy
Eliminates vocabulary alignment
๐Ÿ”Ž Similar Papers
No similar papers found.