No Need to Talk: Asynchronous Mixture of Language Models

📅 2024-10-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address high inter-node communication overhead, reliance on full-corpus clustering for routing, and excessive parameter activation during inference in Mixture-of-Experts (MoE) language models, this paper proposes SMALLTALK LM: a sparse MoE architecture enabled by near-asynchronous distributed training. Its core innovation is a dynamic prefix-aware lightweight router that selects experts solely based on the input sequence prefix—eliminating the need for global corpus clustering or auxiliary metadata. Experts are trained asynchronously and specialize in local subspaces of the data distribution. Under equivalent training FLOPs, SMALLTALK LM achieves significantly lower perplexity. At inference, its computational cost matches that of dense models, while activating only ~25% of total parameters. It outperforms same-scale dense baselines on 75% of downstream tasks, demonstrating superior training efficiency and practical deployability.

Technology Category

Application Category

📝 Abstract

We introduce SMALLTALK LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need for high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model. Unlike prior works on asynchronous LLM training, our routing method does not rely on full corpus clustering or access to metadata, making it more suitable for real-world applications. Our experiments on language modeling demonstrate that SMALLTALK LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on 75% of the tasks.

Problem

Research questions and friction points this paper is trying to address.

Trains mixture of language models asynchronously without high-bandwidth communication

Uses lightweight router to direct sequences to specialized expert models

Achieves better performance than dense models with similar inference cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous mixture of language models training

Lightweight router directs sequences to experts

No reliance on full corpus clustering

🔎 Similar Papers

No similar papers found.