RankMixer: Scaling Up Ranking Models in Industrial Recommenders

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Industrial recommendation systems face critical challenges including poor scalability of ranking models, low GPU matrix floating-point utilization (MFU) (only 4.5%), and difficulty balancing high latency and high queries-per-second (QPS) requirements. To address these, we propose RankMixer—a hardware-aware, efficient ranking architecture that replaces self-attention with a multi-head token-mixing module, employs per-token feed-forward networks to model feature subspaces and cross-subspace interactions, and integrates dynamic-routing Sparse-Mixture-of-Experts (Sparse-MoE) for billion-parameter scaling. This design significantly improves GPU parallel efficiency and computational density. Evaluated on trillion-scale production data, RankMixer achieves 45% MFU—10× higher than baseline—while scaling model parameters by 100× without increasing inference latency. After full deployment, it yields +0.2 days in user active days and +0.5% in average session duration.

Technology Category

Application Category

📝 Abstract

Recent progress on large language models (LLMs) has spurred interest in scaling up recommendation systems, yet two practical obstacles remain. First, training and serving cost on industrial Recommenders must respect strict latency bounds and high QPS demands. Second, most human-designed feature-crossing modules in ranking models were inherited from the CPU era and fail to exploit modern GPUs, resulting in low Model Flops Utilization (MFU) and poor scalability. We introduce RankMixer, a hardware-aware model design tailored towards a unified and scalable feature-interaction architecture. RankMixer retains the transformer's high parallelism while replacing quadratic self-attention with multi-head token mixing module for higher efficiency. Besides, RankMixer maintains both the modeling for distinct feature subspaces and cross-feature-space interactions with Per-token FFNs. We further extend it to one billion parameters with a Sparse-MoE variant for higher ROI. A dynamic routing strategy is adapted to address the inadequacy and imbalance of experts training. Experiments show RankMixer's superior scaling abilities on a trillion-scale production dataset. By replacing previously diverse handcrafted low-MFU modules with RankMixer, we boost the model MFU from 4.5% to 45%, and scale our ranking model parameters by 100x while maintaining roughly the same inference latency. We verify RankMixer's universality with online A/B tests across three core application scenarios (Recommendation, Advertisement and Search). Finally, we launch 1B Dense-Parameters RankMixer for full traffic serving without increasing the serving cost, which improves user active days by 0.2% and total in-app usage duration by 0.5%.

Problem

Research questions and friction points this paper is trying to address.

Scaling ranking models under strict latency and QPS constraints

Improving Model Flops Utilization in GPU-based recommendation systems

Enhancing feature interaction efficiency in large-scale industrial recommenders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-aware transformer with token mixing

Per-token FFNs for feature interactions

Sparse-MoE variant for billion-scale scaling

🔎 Similar Papers

A Comprehensive Survey on Retrieval Methods in Recommender Systems