Mixture of Lookup Experts

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

MoE models reduce computational cost during inference via sparse expert activation but require all expert parameters to reside permanently in VRAM, leading to high memory footprint and deployment constraints. This paper proposes Mixture of Lookup Experts (MoLE), a novel architecture wherein experts are pre-trained as feed-forward networks (FFNs) and subsequently reparameterized into static, input-ID-driven lookup tables (LUTs), enabling zero-FLOP expert output retrieval. MoLE fully decouples training from inference, supporting fine-grained expert offloading and storage-level indexed access. Consequently, it substantially alleviates VRAM bottlenecks and offloading latency. Under identical FLOPs and memory budgets, MoLE achieves inference throughput comparable to dense models and superior to offloading-based MoE variants, while maintaining accuracy on par with standard MoE.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE.

Problem

Research questions and friction points this paper is trying to address.

Reduces VRAM usage in Mixture-of-Experts models

Minimizes inference latency by avoiding expert computations

Enables efficient offloading of experts to storage devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

MoLE re-parameterizes experts as lookup tables

Reduces VRAM usage and communication overhead

Achieves fast inference speeds comparable to dense models

🔎 Similar Papers

A Survey on Mixture of Experts in Large Language Models