🤖 AI Summary
This work addresses the limitations of existing industrial-scale recommendation models when scaled to extreme sizes, which suffer from suboptimal architectures, low hardware utilization, vanishing gradients, and insufficient sparsity. To overcome these challenges, we propose TokenMixer-Large, a novel architecture that integrates a mix-and-reduce mechanism, inter-layer residual connections, token-wise sparse Mixture-of-Experts (MoE), and auxiliary losses to effectively mitigate gradient propagation issues and parameter scaling bottlenecks in deep networks. The model supports up to 7 billion parameters in online deployment and 15 billion offline, and has been deployed across multiple ByteDance scenarios, consistently delivering significant gains: +1.66% in e-commerce orders, +2.98% in per-user preview-to-pay GMV, +2.0% in advertising ADSS, and +1.4% in live-streaming revenue.
📝 Abstract
While scaling laws for recommendation models have gained significant traction, existing architectures such as Wukong, HiFormer and DHEN, often struggle with sub-optimal designs and hardware under-utilization, limiting their practical scalability. Our previous TokenMixer architecture (introduced in RankMixer paper) addressed effectiveness and efficiency by replacing self-attention with a ightweight token-mixing operator; however, it faced critical bottlenecks in deeper configurations, including sub-optimal residual paths, vanishing gradients, incomplete MoE sparsification and constrained scalability. In this paper, we propose TokenMixer-Large, a systematically evolved architecture designed for extreme-scale recommendation. By introducing a mixing-and-reverting operation, inter-layer residuals and the auxiliary loss, we ensure stable gradient propagation even as model depth increases. Furthermore, we incorporate a Sparse Per-token MoE to enable efficient parameter expansion. TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer-Large has achieved significant offline and online performance gains, delivering an increase of +1.66\% in orders and +2.98\% in per-capita preview payment GMV for e-commerce, improving ADSS by +2.0\% in advertising and achieving a +1.4\% revenue growth for live streaming.