🤖 AI Summary
Existing approaches struggle to balance the high cost of general-purpose large models with the limited generalization of specialized small models. To address this, this work proposes FusionRoute, a framework that, at each decoding step, employs a lightweight token-level router to dynamically select the optimal expert model and introduces a learnable complementary generator to additively refine the expert logits. This approach transcends the conventional limitation of routing mechanisms that rely solely on expert outputs, thereby expanding the decoding strategy space and approximating the optimal policy under weaker assumptions. Experiments demonstrate that FusionRoute consistently outperforms current collaboration, merging, and fine-tuning methods across tasks including mathematical reasoning, code generation, and instruction following on Llama-3 and Gemma-2 model families, while maintaining performance comparable to domain-specific experts.
📝 Abstract
Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.