MixFormer: Co-Scaling Up Dense and Sequence in Industrial Recommenders

πŸ“… 2026-02-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation in industrial recommendation systems where decoupling sequential modeling and feature interaction modules hinders synergistic scaling under constrained computational budgets. To overcome this, we propose MixFormerβ€”a unified Transformer architecture that jointly models user behavioral sequences and high-order feature interactions within a single backbone, for the first time integrating both components into a shared parameterized framework. By introducing a user-item decoupling strategy, MixFormer significantly enhances inference efficiency and deployment feasibility. Extensive evaluations on large-scale industrial datasets and online A/B tests on Douyin and Douyin Lite demonstrate consistent and significant improvements across key metrics, including recommendation accuracy, user active days, and session duration.

Technology Category

Application Category

πŸ“ Abstract
As industrial recommender systems enter a scaling-driven regime, Transformer architectures have become increasingly attractive for scaling models towards larger capacity and longer sequence. However, existing Transformer-based recommendation models remain structurally fragmented, where sequence modeling and feature interaction are implemented as separate modules with independent parameterization. Such designs introduce a fundamental co-scaling challenge, as model capacity must be suboptimally allocated between dense feature interaction and sequence modeling under a limited computational budget. In this work, we propose MixFormer, a unified Transformer-style architecture tailored for recommender systems, which jointly models sequential behaviors and feature interactions within a single backbone. Through a unified parameterization, MixFormer enables effective co-scaling across both dense capacity and sequence length, mitigating the trade-off observed in decoupled designs. Moreover, the integrated architecture facilitates deep interaction between sequential and non-sequential representations, allowing high-order feature semantics to directly inform sequence aggregation and enhancing overall expressiveness. To ensure industrial practicality, we further introduce a user-item decoupling strategy for efficiency optimizations that significantly reduce redundant computation and inference latency. Extensive experiments on large-scale industrial datasets demonstrate that MixFormer consistently exhibits superior accuracy and efficiency. Furthermore, large-scale online A/B tests on two production recommender systems, Douyin and Douyin Lite, show consistent improvements in user engagement metrics, including active days and in-app usage duration.
Problem

Research questions and friction points this paper is trying to address.

co-scaling
Transformer-based recommendation
sequence modeling
feature interaction
industrial recommender systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

MixFormer
co-scaling
unified architecture
feature interaction
sequence modeling
πŸ”Ž Similar Papers
No similar papers found.
X
Xu Huang
ByteDance, Shanghai, China
H
Hao Zhang
ByteDance, Shanghai, China
Zhifang Fan
Zhifang Fan
Alibaba
Natural Language ProcessingInformation RetrievalRecommender System
Y
Yunwen Huang
ByteDance, Beijing, China
Z
Zhuoxing Wei
ByteDance, Beijing, China
Zheng Chai
Zheng Chai
ByteDance
Machine LearningSequential ModelingRecommendationAnomaly Detection/Recognition
J
Jinan Ni
ByteDance, Shanghai, China
Y
Yuchao Zheng
ByteDance, Hangzhou, China
Q
Qiwei Chen
ByteDance, Shanghai, China