Scaling Embeddings Outperforms Scaling Experts in Language Models

📅 2026-01-29
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the diminishing returns and system bottlenecks encountered by Mixture-of-Experts (MoE) architectures in sparse scaling of language models. It proposes embedding-layer expansion as an orthogonal alternative to expert expansion, leveraging large-scale embedding growth, strategic parameter budget allocation, and co-design of model width and depth. Combined with speculative decoding and tailored system-level optimizations, this approach substantially enhances inference efficiency. The resulting 68.5B-parameter LongCat-Flash-Lite model—activating approximately 3B parameters per forward pass—outperforms comparable MoE models on agent-based tasks and code generation, achieving a superior Pareto frontier in the trade-off between performance and efficiency.

Technology Category

Application Category

📝 Abstract
While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
sparsity scaling
diminishing returns
system bottlenecks
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

embedding scaling
sparsity
Mixture-of-Experts
Pareto frontier
speculative decoding
🔎 Similar Papers
No similar papers found.
H
Hong Liu
Meituan LongCat Team
J
Jiaqi Zhang
Meituan LongCat Team
Chao Wang
Chao Wang
Ke Holdings Inc.
AI/ VR/ Cloud/ Data Security
X
Xing Hu
Meituan LongCat Team
L
Linkun Lyu
Meituan LongCat Team
Jiaqi Sun
Jiaqi Sun
Carnegie Mellon University
Causalitygraph representation learning
X
Xurui Yang
Meituan LongCat Team
B
Bo Wang
Meituan LongCat Team
F
Fengcun Li
Meituan LongCat Team
Y
Yulei Qian
Meituan LongCat Team
L
Lingtong Si
Meituan LongCat Team
Y
Yerui Sun
Meituan LongCat Team
R
Rumei Li
Meituan LongCat Team
P
Peng Pei
Meituan LongCat Team
Y
Yuchen Xie
Meituan LongCat Team
X
Xunliang Cai
Meituan LongCat Team