π€ AI Summary
This work addresses the limitation in industrial recommendation systems where decoupling sequential modeling and feature interaction modules hinders synergistic scaling under constrained computational budgets. To overcome this, we propose MixFormerβa unified Transformer architecture that jointly models user behavioral sequences and high-order feature interactions within a single backbone, for the first time integrating both components into a shared parameterized framework. By introducing a user-item decoupling strategy, MixFormer significantly enhances inference efficiency and deployment feasibility. Extensive evaluations on large-scale industrial datasets and online A/B tests on Douyin and Douyin Lite demonstrate consistent and significant improvements across key metrics, including recommendation accuracy, user active days, and session duration.
π Abstract
As industrial recommender systems enter a scaling-driven regime, Transformer architectures have become increasingly attractive for scaling models towards larger capacity and longer sequence. However, existing Transformer-based recommendation models remain structurally fragmented, where sequence modeling and feature interaction are implemented as separate modules with independent parameterization. Such designs introduce a fundamental co-scaling challenge, as model capacity must be suboptimally allocated between dense feature interaction and sequence modeling under a limited computational budget. In this work, we propose MixFormer, a unified Transformer-style architecture tailored for recommender systems, which jointly models sequential behaviors and feature interactions within a single backbone. Through a unified parameterization, MixFormer enables effective co-scaling across both dense capacity and sequence length, mitigating the trade-off observed in decoupled designs. Moreover, the integrated architecture facilitates deep interaction between sequential and non-sequential representations, allowing high-order feature semantics to directly inform sequence aggregation and enhancing overall expressiveness. To ensure industrial practicality, we further introduce a user-item decoupling strategy for efficiency optimizations that significantly reduce redundant computation and inference latency. Extensive experiments on large-scale industrial datasets demonstrate that MixFormer consistently exhibits superior accuracy and efficiency. Furthermore, large-scale online A/B tests on two production recommender systems, Douyin and Douyin Lite, show consistent improvements in user engagement metrics, including active days and in-app usage duration.