🤖 AI Summary
This work addresses the performance scaling limitations of conventional large language models, which are constrained by the tight coupling between computation and model capacity. While sparse architectures like Mixture of Experts (MoE) decouple these factors, they incur substantial memory and hardware overhead. The authors propose treating token indexing as a fourth scaling dimension—orthogonal to width, depth, and data scale—by introducing modulation vectors retrieved from an auxiliary embedding table into Transformer layers. These vectors modulate the main network via lightweight element-wise operations, significantly enhancing model capacity with negligible increase in FLOPs. The proposed Joint-Token (JTok) and Mixture of JTok (JTok-M) mechanisms exhibit predictable power-law scaling behavior. Experiments demonstrate consistent gains across model sizes from 650M to 61B parameters, improving MMLU, ARC, and CEval scores by 4.1, 8.3, and 8.9 points, respectively. At equivalent performance, JTok-M reduces compute requirements by 35% compared to standard MoE while maintaining minimal runtime overhead.
📝 Abstract
LLMs have traditionally scaled along dense dimensions, where performance is coupled with near-linear increases in computational cost. While MoE decouples capacity from compute, it introduces large memory overhead and hardware efficiency challenges. To overcome these, we propose token-indexed parameters as a novel, orthogonal scaling axis that decouple model capacity from FLOPs. Specifically, we introduce Joint-Token (JTok) and Mixture of Joint-Token (JTok-M), which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight, element-wise operations, incurring negligible FLOPs overhead. Extensive experiments on both dense and MoE backbones, spanning from 650M (190M + 460M embedding) to 61B (17B + 44B embedding) total parameters, demonstrate that our approach consistently reduces validation loss and significantly improves downstream task performance (e.g., +4.1 on MMLU, +8.3 on ARC, +8.9 on CEval). Rigorous isoFLOPs analysis further confirms that JTok-M fundamentally shifts the quality-compute Pareto frontier, achieving comparable model quality with 35% less compute relative to vanilla MoE architectures, and we validate that token-indexed parameters exhibit a predictable power-law scaling behavior. Moreover, our efficient implementation ensures that the overhead introduced by JTok and JTok-M remains marginal.