Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing heterogeneous large language model (LLM) fusion methods rely on limited real-world data and fixed cross-domain sampling ratios, leading to incomplete knowledge coverage and imbalanced cross-domain capability. This paper proposes a purely synthetic-data-driven fusion framework. First, it introduces a novel unsupervised multi-domain exploration mechanism based on a hierarchical knowledge tree for automatic domain discovery and organization. Second, it formulates domain expansion and sampling allocation as a hierarchical multi-armed bandit problem, integrating the DynaBranches dynamic policy with a sliding-window binomial likelihood ratio test (SWBLRT) for online capability tracking and adaptive sampling. Third, it designs an Introspection-Rebirth mechanism enabling collaborative synthetic data generation across multiple models. Experiments demonstrate substantial improvements in data efficiency, consistent outperformance over state-of-the-art methods across multiple benchmarks, near-elimination of cross-domain capability bias, and broad compatibility with diverse target LLMs.

Technology Category

Application Category

📝 Abstract

Heterogeneous Large Language Model (LLM) fusion integrates the strengths of multiple source LLMs with different architectures into a target LLM with low computational overhead. While promising, existing methods suffer from two major limitations: 1) reliance on real data from limited domain for knowledge fusion, preventing the target LLM from fully acquiring knowledge across diverse domains, and 2) fixed data allocation proportions across domains, failing to dynamically adjust according to the target LLM's varying capabilities across domains, leading to a capability imbalance. To overcome these limitations, we propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework. Through the organization of knowledge domains into a hierarchical tree structure, Bohdi enables automatic domain exploration and multi-domain data generation through multi-model collaboration, thereby comprehensively extracting knowledge from source LLMs. By formalizing domain expansion and data sampling proportion allocation on the knowledge tree as a Hierarchical Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism to adaptively adjust sampling proportions based on the target LLM's performance feedback across domains. Integrated with our proposed Introspection-Rebirth (IR) mechanism, DynaBranches dynamically tracks capability shifts during target LLM's updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT), further enhancing its online adaptation capability. Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM's capabilities. Our code is available at https://github.com/gjq100/Bohdi.git.

Problem

Research questions and friction points this paper is trying to address.

Overcomes reliance on limited real data for LLM fusion

Dynamically adjusts data allocation across knowledge domains

Enhances target LLM's capability balance and efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic-data-only heterogeneous LLM fusion framework

Hierarchical Multi-Armed Bandit for adaptive data sampling

DynaBranches mechanism with performance feedback tracking

🔎 Similar Papers

Cool-Fusion: Fuse Large Language Models without Training