InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitation of existing logit-level fusion methods for heterogeneous open-source large language models—namely, their neglect of lexical semantic dependencies—this paper proposes a structure-aware logits fusion framework. Methodologically, it (1) constructs a logits co-activation graph to model joint activation patterns across vocabulary dimensions; (2) introduces Graph-on-Logits Distillation (GLD), a novel distillation loss operating on the logits graph; and (3) pioneers the integration of Gromov–Wasserstein (GW) distance into logits fusion, accompanied by an *O*(*n* log *n*) sorting-based closed-form approximation that relaxes the standard independence assumption across logits dimensions. Implemented via graph neural networks, the framework combines top-*k* outer-product aggregation with optimal transport optimization. Evaluated on 11 reasoning, programming, and mathematical benchmarks, it achieves state-of-the-art performance across all tasks—e.g., +35.6 on Multistep Arithmetic and +37.06 on Causal Judgement—demonstrating substantial gains in multi-step and relational reasoning.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose extbf{InfiGFusion}, the first structure-aware fusion framework with a novel extit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.
Problem

Research questions and friction points this paper is trying to address.

Fusing heterogeneous models to unify complementary strengths
Modeling semantic dependencies in logit-based fusion methods
Reducing computational cost of Gromov-Wasserstein distance for scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-on-Logits Distillation for semantic dependencies
Efficient Gromov-Wasserstein approximation for scalability
Global co-activation graph for model fusion
🔎 Similar Papers
No similar papers found.