SORT: A Systematically Optimized Ranking Transformer for Industrial-scale Recommenders

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottlenecks of Transformer-based models in industrial-scale recommendation systems, which stem from high feature sparsity and low label density. To overcome these challenges, the authors propose a systematic optimization framework incorporating request-centric sampling, localized attention mechanisms, query pruning, and generative pretraining, alongside module-level enhancements to tokenization, multi-head attention (MHA), and feed-forward networks (FFN). The proposed approach substantially improves training stability and model capacity while enabling efficient hardware utilization. Online A/B experiments demonstrate significant gains in key business metrics—orders increased by 6.35%, buyers by 5.97%, and GMV by 5.47%—alongside a 44.67% reduction in inference latency and a 121.33% improvement in throughput.

Technology Category

Application Category

📝 Abstract
While Transformers have achieved remarkable success in LLMs through superior scalability, their application in industrial-scale ranking models remains nascent, hindered by the challenges of high feature sparsity and low label density. In this paper, we propose SORT (Systematically Optimized Ranking Transformer), a scalable model designed to bridge the gap between Transformers and industrial-scale ranking models. We address the high feature sparsity and low label density challenges through a series of optimizations, including request-centric sample organization, local attention, query pruning and generative pre-training. Furthermore, we introduce a suite of refinements to the tokenization, multi-head attention (MHA), and feed-forward network (FFN) modules, which collectively stabilize the training process and enlarge the model capacity. To maximize hardware efficiency, we optimize our training system to elevate the model FLOPs utilization (MFU) to 22%. Extensive experiments demonstrate that SORT outperforms strong baselines and exhibits excellent scalability across data size, model size and sequence length, while remaining flexible at integrating diverse features. Finally, online A/B testing in large-scale e-commerce scenarios confirms that SORT achieves significant gains in key business metrics, including orders (+6.35%), buyers (+5.97%) and GMV (+5.47%), while simultaneously halving latency (-44.67%) and doubling throughput (+121.33%).
Problem

Research questions and friction points this paper is trying to address.

feature sparsity
label density
industrial-scale recommenders
ranking models
Transformer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically Optimized Ranking Transformer
local attention
query pruning
generative pre-training
model FLOPs utilization
🔎 Similar Papers
No similar papers found.
Chunqi Wang
Chunqi Wang
Alibaba Group
B
Bingchao Wu
Alibaba International Digital Commercial Group
T
Taotian Pang
Alibaba International Digital Commercial Group
Jiahao Wang
Jiahao Wang
Xi'an Jiaotong University; Alibaba Cloud
Computer VisionGenerative ModelAIGC
J
Jie Yang
Alibaba International Digital Commercial Group
J
Jia Liu
Alibaba International Digital Commercial Group
Hao Zhang
Hao Zhang
Alibaba DAMO Academy, NTU
Vision and LanguageNatural Language Processing
H
Hai Zhu
Alibaba International Digital Commercial Group
L
Lei Shen
Alibaba International Digital Commercial Group
S
Shizhun Wang
Alibaba International Digital Commercial Group
Bing Wang
Bing Wang
Jilin University | Alibaba Group
Large Language ModelsAI SafetyMisinformation
X
Xiaoyi Zeng
Alibaba International Digital Commercial Group