OLMoE: Open Mixture-of-Experts Language Models

📅 2024-09-03
🏛️ arXiv.org
📈 Citations: 32
Influential: 5
📄 PDF
🤖 AI Summary
This work addresses the challenge of building efficient and scalable language models by proposing OLMoE-1B-7B, a fully open-source sparse Mixture-of-Experts (MoE) architecture: it activates only 1 billion parameters per token yet is pretrained on 5 trillion tokens, with an instruction-tuned variant—OLMoE-1B-7B-Instruct. Methodologically, it pioneers end-to-end openness for MoE models—releasing weights, training data, code, and full training logs—and introduces a highly specialized dynamic routing mechanism, integrated with large-scale distributed training and systematic MoE training analysis. Empirically, OLMoE-1B-7B outperforms Llama2-13B-Chat and DeepSeekMoE-16B across multiple benchmarks at equivalent activated parameter counts, validating the efficacy and advancement of the “small activation, large capacity, full openness” paradigm.

Technology Category

Application Category

📝 Abstract
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
Problem

Research questions and friction points this paper is trying to address.

Develops OLMoE, an open Mixture-of-Experts language model.
Achieves high efficiency with 1B active parameters per token.
Outperforms larger models like Llama2-13B-Chat and DeepSeekMoE-16B.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Mixture-of-Experts (MoE) architecture
Pretrained on 5 trillion tokens
Open-source model weights and training data
🔎 Similar Papers
No similar papers found.
Niklas Muennighoff
Niklas Muennighoff
Stanford University
large language modelsartificial intelligencemachine learning
Luca Soldaini
Luca Soldaini
Allen Institute for AI
Large Language ModelsOpen Source AIInformation Retrieval
Dirk Groeneveld
Dirk Groeneveld
Allen Institute for Artificial Intelligence
natural language processingneural networksdeep learning
Kyle Lo
Kyle Lo
Allen Institute for AI
natural language processingmachine learninghuman computer interactionstatistics
J
Jacob Daniel Morrison
Allen Institute for AI
Sewon Min
Sewon Min
UC Berkeley EECS & Allen Institute for AI
Natural Language ProcessingMachine Learning
Weijia Shi
Weijia Shi
University of Washington
Natural Language ProcessingMachine Learning
P
Pete Walsh
Allen Institute for AI
O
Oyvind Tafjord
Allen Institute for AI
Nathan Lambert
Nathan Lambert
Research Scientist, Allen AI
Reinforcement LearningMachine LearningRoboticsResponsible AI
Y
Yuling Gu
Allen Institute for AI
S
Shane Arora
Allen Institute for AI
A
Akshita Bhagia
Allen Institute for AI
D
Dustin Schwenk
Allen Institute for AI
David Wadden
David Wadden
Google Deepmind
Natural Language ProcessingMachine Learning
Alexander Wettig
Alexander Wettig
Princeton University
Natural Language Processing
Binyuan Hui
Binyuan Hui
Qwen Team, Alibaba Group
Large Language ModelsCodeLLMsReasoningAgent
Tim Dettmers
Tim Dettmers
Allen Institute for AI; Carnegie Mellon University
Deep LearningNatural Language Processing
Douwe Kiela
Douwe Kiela
Contextual AI, Stanford University
Natural Language ProcessingMachine LearningArtificial Intelligence
A
Ali Farhadi
University of Washington
Noah A. Smith
Noah A. Smith
University of Washington; Allen Institute for Artificial Intelligence
natural language processingmachine learningcomputational social sciencecomputer music
Pang Wei Koh
Pang Wei Koh
University of Washington; Allen Institute for AI
Machine learningNatural language processingComputational biology
A
Amanpreet Singh
Contextual AI
H
Hanna Hajishirzi
University of Washington