JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance scaling limitations of conventional large language models, which are constrained by the tight coupling between computation and model capacity. While sparse architectures like Mixture of Experts (MoE) decouple these factors, they incur substantial memory and hardware overhead. The authors propose treating token indexing as a fourth scaling dimension—orthogonal to width, depth, and data scale—by introducing modulation vectors retrieved from an auxiliary embedding table into Transformer layers. These vectors modulate the main network via lightweight element-wise operations, significantly enhancing model capacity with negligible increase in FLOPs. The proposed Joint-Token (JTok) and Mixture of JTok (JTok-M) mechanisms exhibit predictable power-law scaling behavior. Experiments demonstrate consistent gains across model sizes from 650M to 61B parameters, improving MMLU, ARC, and CEval scores by 4.1, 8.3, and 8.9 points, respectively. At equivalent performance, JTok-M reduces compute requirements by 35% compared to standard MoE while maintaining minimal runtime overhead.

Technology Category

Application Category

📝 Abstract
LLMs have traditionally scaled along dense dimensions, where performance is coupled with near-linear increases in computational cost. While MoE decouples capacity from compute, it introduces large memory overhead and hardware efficiency challenges. To overcome these, we propose token-indexed parameters as a novel, orthogonal scaling axis that decouple model capacity from FLOPs. Specifically, we introduce Joint-Token (JTok) and Mixture of Joint-Token (JTok-M), which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight, element-wise operations, incurring negligible FLOPs overhead. Extensive experiments on both dense and MoE backbones, spanning from 650M (190M + 460M embedding) to 61B (17B + 44B embedding) total parameters, demonstrate that our approach consistently reduces validation loss and significantly improves downstream task performance (e.g., +4.1 on MMLU, +8.3 on ARC, +8.9 on CEval). Rigorous isoFLOPs analysis further confirms that JTok-M fundamentally shifts the quality-compute Pareto frontier, achieving comparable model quality with 35% less compute relative to vanilla MoE architectures, and we validate that token-indexed parameters exhibit a predictable power-law scaling behavior. Moreover, our efficient implementation ensures that the overhead introduced by JTok and JTok-M remains marginal.
Problem

Research questions and friction points this paper is trying to address.

scaling law
token embedding
model capacity
computational efficiency
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

token-indexed parameters
scaling law
Joint-Token
compute-efficient modulation
orthogonal scaling axis
🔎 Similar Papers
No similar papers found.
Y
Yebin Yang
School of AI, Shanghai Jiao Tong University
Huaijin Wu
Huaijin Wu
Shanghai Jiao Tong University
Machine LearningAI for Drug Design
F
Fu Guo
Hi Lab, Xiaohongshu Inc.
L
Lin Yao
Hi Lab, Xiaohongshu Inc.
X
Xiaohan Qin
School of AI, Shanghai Jiao Tong University
J
Jingzhi Wang
School of AI, Shanghai Jiao Tong University
Debing Zhang
Debing Zhang
Xiaohongshu
Machine LearningComputer VisionDeep Learning
Junchi Yan
Junchi Yan
FIAPR & ICML Board Member, SJTU (2018-), SII (2024-), AWS (2019-2022), IBM (2011-2018)
Computational IntelligenceAI4ScienceMachine LearningAutonomous Driving