TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the limitations of existing industrial-scale recommendation models when scaled to extreme sizes, which suffer from suboptimal architectures, low hardware utilization, vanishing gradients, and insufficient sparsity. To overcome these challenges, we propose TokenMixer-Large, a novel architecture that integrates a mix-and-reduce mechanism, inter-layer residual connections, token-wise sparse Mixture-of-Experts (MoE), and auxiliary losses to effectively mitigate gradient propagation issues and parameter scaling bottlenecks in deep networks. The model supports up to 7 billion parameters in online deployment and 15 billion offline, and has been deployed across multiple ByteDance scenarios, consistently delivering significant gains: +1.66% in e-commerce orders, +2.98% in per-user preview-to-pay GMV, +2.0% in advertising ADSS, and +1.4% in live-streaming revenue.

Technology Category

Application Category

📝 Abstract

While scaling laws for recommendation models have gained significant traction, existing architectures such as Wukong, HiFormer and DHEN, often struggle with sub-optimal designs and hardware under-utilization, limiting their practical scalability. Our previous TokenMixer architecture (introduced in RankMixer paper) addressed effectiveness and efficiency by replacing self-attention with a ightweight token-mixing operator; however, it faced critical bottlenecks in deeper configurations, including sub-optimal residual paths, vanishing gradients, incomplete MoE sparsification and constrained scalability. In this paper, we propose TokenMixer-Large, a systematically evolved architecture designed for extreme-scale recommendation. By introducing a mixing-and-reverting operation, inter-layer residuals and the auxiliary loss, we ensure stable gradient propagation even as model depth increases. Furthermore, we incorporate a Sparse Per-token MoE to enable efficient parameter expansion. TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer-Large has achieved significant offline and online performance gains, delivering an increase of +1.66\% in orders and +2.98\% in per-capita preview payment GMV for e-commerce, improving ADSS by +2.0\% in advertising and achieving a +1.4\% revenue growth for live streaming.

Problem

Research questions and friction points this paper is trying to address.

large ranking models

industrial recommenders

scalability

vanishing gradients

MoE sparsification

Innovation

Methods, ideas, or system contributions that make the work stand out.

TokenMixer-Large

Sparse Per-token MoE

mixing-and-reverting