Probabilistic Token Alignment for Large Language Model Fusion

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing large language model (LLM) fusion approaches rely on manually predefined vocabulary alignment, limiting adaptability to diverse contextual settings and hindering fusion efficacy. To address this, we propose a probabilistic token alignment framework grounded in optimal transport, which formulates alignment as a soft mapping problem between token distributions—enabling automatic, interpretable, and architecture-agnostic alignment across heterogeneous models. Our method integrates distribution-aware learning with probabilistic mapping modeling, eliminating manual intervention while achieving fine-grained, semantics-preserving token-level matching and supporting end-to-end parameter fusion. Extensive evaluations across multiple benchmarks demonstrate that the fused models consistently outperform baselines in reasoning, commonsense understanding, and linguistic comprehension, validating the method’s effectiveness, robustness, and generalizability. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) from scratch can yield models with unique functionalities and strengths, but it is costly and often leads to redundant capabilities. A more cost-effective alternative is to fuse existing pre-trained LLMs with different architectures into a more powerful model. However, a key challenge in existing model fusion is their dependence on manually predefined vocabulary alignment, which may not generalize well across diverse contexts, leading to performance degradation in several evaluation. To solve this, we draw inspiration from distribution learning and propose the probabilistic token alignment method as a general and soft mapping for alignment, named as PTA-LLM. Our approach innovatively reformulates token alignment into a classic mathematical problem: optimal transport, seamlessly leveraging distribution-aware learning to facilitate more coherent model fusion. Apart from its inherent generality, PTA-LLM exhibits interpretability from a distributional perspective, offering insights into the essence of the token alignment. Empirical results demonstrate that probabilistic token alignment enhances the target model's performance across multiple capabilities. Our code is avaliable at https://runjia.tech/neurips_pta-llm/.

Problem

Research questions and friction points this paper is trying to address.

Fusing existing pre-trained LLMs with different architectures effectively

Overcoming manual vocabulary alignment limitations in model fusion

Achieving coherent token alignment across diverse language contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic token alignment for model fusion

Reformulates alignment as optimal transport problem

Uses distribution-aware learning for coherent fusion

🔎 Similar Papers

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling