š¤ AI Summary
This work addresses the hardware inefficiency and training instability of existing sparse language models employing dynamic hard routing (e.g., Mixture of Experts), as well as the lack of contextual awareness in conventional embedding tables. The authors propose the Large Lookup Layer (L³), which extends the embedding table paradigm to the decoder layer for the first time, utilizing a static, token-based routing mechanism that aggregates embeddings in a learnable, context-aware manner. L³ integrates an information-theoretic embedding assignment algorithm, enhancing model expressivity while preserving inference efficiency. It further enables rapid training and zero-overhead CPU offloading. At a scale of 2.6B activated parameters, L³ significantly outperforms both MoE and dense baselines of comparable sparsity on language modeling and downstream tasks.
š Abstract
Modern sparse language models typically achieve sparsity through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense MLP"experts."However, dynamic hard routing has a number of drawbacks, such as potentially poor hardware efficiency and needing auxiliary losses for stable training. In contrast, the tokenizer embedding table, which is natively sparse, largely avoids these issues by selecting a single embedding per token at the cost of not having contextual information. In this work, we introduce the Large Lookup Layer (L$^3$), which unlocks a new axis of sparsity by generalizing embedding tables to model decoder layers. L$^3$ layers use static token-based routing to aggregate a set of learned embeddings per token in a context-dependent way, allowing the model to efficiently balance memory and compute by caching information in embeddings. L$^3$ has two main components: (1) a systems-friendly architecture that allows for fast training and CPU-offloaded inference with no overhead, and (2) an information-theoretic embedding allocation algorithm that effectively balances speed and quality. We empirically test L$^3$ by training transformers with up to 2.6B active parameters and find that L$^3$ strongly outperforms both dense models and iso-sparse MoEs in both language modeling and downstream tasks.