From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Balancing inference efficiency and accuracy remains challenging for pre-trained large language models (LLMs). To address this, we propose DynaMoE: a dynamic sparse Mixture-of-Experts (MoE) framework that transforms a dense LLM via lightweight post-training into a token-adaptive MoE architecture. DynaMoE introduces a token-level difficulty-aware routing mechanism that dynamically assigns each token to expert subnetworks of varying capacity based on predicted computational difficulty. Crucially, only a single, light-weight fine-tuning phase (∼10B tokens) is required to generate multiple model variants with tunable accuracy–latency trade-offs. Our key innovations include the first token-level difficulty prediction router, hierarchical expert sizing design, and low-overhead adaptation strategy. Experiments show that DynaMoE achieves comparable downstream task accuracy to Flextron at merely 1/9 the fine-tuning cost, while significantly enhancing controllability and fine-grained adjustability of the throughput–accuracy Pareto frontier.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only $10B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only $frac{1}{9} ext{th}$ of their fine-tuning cost.

Problem

Research questions and friction points this paper is trying to address.

Optimizes efficiency-accuracy trade-offs in LLMs

Adapts pre-trained LLMs to token-difficulty-driven models

Reduces fine-tuning cost for dynamic model adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-difficulty-aware router

Mixture-of-Experts adaptation

Minimal fine-tuning cost

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models