MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing Mixture-of-Experts (MoE) models lack a unified, modular framework for systematic construction and analysis. Method: This paper introduces MixtureKit, an open-source, modular framework enabling non-intrusive integration of three MoE paradigms—classical MoE, fine-grained routing via BTX (Branch-Train-Mix), and expert freezing with learnable stitching via BTS (Branch-Train-Stitch)—into arbitrary pre-trained or fine-tuned models (e.g., Hugging Face Transformers). MixtureKit pioneers BTX and BTS architectures, supporting cross-model and cross-layer dynamic expert routing, differentiable routing mechanisms, token-level expert assignment visualization, and contribution analysis based on attention weights, complemented by an interactive web interface. Results: On Arabic–Latin code-mixing tasks, BTX achieves 2.3–4.7 BLEU/ACC gains over same-scale dense baselines. The framework is publicly released to advance MoE research and deployment across domains.

Technology Category

Application Category

📝 Abstract

We introduce MixtureKit, a modular open-source framework for constructing, training, and analyzing Mixture-of-Experts (MoE) models from arbitrary pre-trained or fine-tuned models. MixtureKit currently supports three complementary methods: (i) emph{Traditional MoE}, which uses a single router per transformer block to select experts, (ii) emph{BTX} (Branch-Train-Mix), which introduces separate routers for each specified sub-layer enabling fine-grained token routing, and (iii) emph{BTS} (Branch-Train-Stitch), which keeps experts fully intact and introduces trainable stitch layers for controlled information exchange between hub and experts. MixtureKit automatically modifies the model configuration, patches decoder and causal LM classes, and saves a unified checkpoint ready for inference or fine-tuning. We further provide a visualization interface to inspect per-token routing decisions, expert weight distributions, and layer-wise contributions. Experiments with multilingual code-switched data (e.g. Arabic-Latin) show that a BTX-based model trained using MixtureKit can outperform baseline dense models on multiple benchmarks. We release MixtureKit as a practical foundation for research and development of MoE-based systems across diverse domains.

Problem

Research questions and friction points this paper is trying to address.

Develops a framework for constructing and training Mixture-of-Experts models from existing models

Provides methods for fine-grained token routing and controlled information exchange between experts

Offers visualization tools to analyze routing decisions and expert contributions in models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular framework for composing, training, and visualizing MoE models

Supports three methods: Traditional MoE, BTX, and BTS routing

Automatically modifies configurations and provides unified checkpoints for inference

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

2024-10-03arXiv.orgCitations: 0

Together AI

$160,000 - $230,000 + equity + benefits

San Francisco, Singapore, Amsterdam / Remote

Multimodal Model Training and Inference Optimization Engineer

ByteDance

圣何塞

Multimodal Model Training and Inference Optimization Engineer

ByteDance

西雅图

Authors to Follow