🤖 AI Summary
Recommendation systems have long been constrained by multi-stage cascaded architectures, resulting in fragmented computation, misaligned optimization objectives, and difficulty incorporating state-of-the-art AI advances. This paper proposes OneRec—the first end-to-end generative architecture tailored for industrial recommendation, unifying recall, ranking, and generation into a single trainable framework for full-pipeline joint optimization. Key contributions include: (1) establishing a recommendation-specific end-to-end generative paradigm; (2) the first successful deployment of reinforcement learning for optimization in production-scale recommendation; (3) discovery and empirical validation of scaling laws for recommendation models; and (4) FLOPs-aware model scaling coupled with deep GPU optimization, achieving 23.7%/28.8% MFU—comparable to large language models. Experiments demonstrate a 10.6% reduction in operational cost versus conventional pipelines, support for 25% of Kuaishou APP’s QPS, 0.54%–1.24% increase in average user session duration, and significant growth in 7-day user lifetime value.
📝 Abstract
Recommender systems have been widely used in various large-scale user-oriented platforms for many years. However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimization inconsistencies, and hindering the effective application of key breakthrough technologies from the AI community in recommendation scenarios. To address these issues, we propose OneRec, which reshapes the recommendation system through an end-to-end generative approach and achieves promising results. Firstly, we have enhanced the computational FLOPs of the current recommendation model by 10 $ imes$ and have identified the scaling laws for recommendations within certain boundaries. Secondly, reinforcement learning techniques, previously difficult to apply for optimizing recommendations, show significant potential in this framework. Lastly, through infrastructure optimizations, we have achieved 23.7% and 28.8% Model FLOPs Utilization (MFU) on flagship GPUs during training and inference, respectively, aligning closely with the LLM community. This architecture significantly reduces communication and storage overhead, resulting in operating expense that is only 10.6% of traditional recommendation pipelines. Deployed in Kuaishou/Kuaishou Lite APP, it handles 25% of total queries per second, enhancing overall App Stay Time by 0.54% and 1.24%, respectively. Additionally, we have observed significant increases in metrics such as 7-day Lifetime, which is a crucial indicator of recommendation experience. We also provide practical lessons and insights derived from developing, optimizing, and maintaining a production-scale recommendation system with significant real-world impact.