Compute Only Once: UG-Separation for Efficient Large Recommendation Models

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high inference cost in large-scale recommender systems caused by dense interaction architectures, where deep coupling between user and item representations prevents reuse of user-side computations. To overcome this, we propose the User-Item Separation (UG-Sep) framework, which explicitly decouples user and item representations via an information-flow masking mechanism, enabling partial user computation to be reused across samples. We further introduce an adaptive information compensation strategy and W8A16 quantization to maintain model expressiveness while significantly improving efficiency. Our approach is the first to enable user-side computation reuse in dense-interaction recommendation models and extends the KV caching concept to non-sequential architectures. Evaluated across multiple ByteDance production scenarios, UG-Sep reduces inference latency by up to 20% without degrading online user experience or business metrics.

Technology Category

Application Category

📝 Abstract
Driven by scaling laws, recommender systems increasingly rely on large-scale models to capture complex feature interactions and user behaviors, but this trend also leads to prohibitive training and inference costs. While long-sequence models(e.g., LONGER) can reuse user-side computation through KV caching, such reuse is difficult in dense feature interaction architectures(e.g., RankMixer), where user and group (candidate item) features are deeply entangled across layers. In this work, we propose User-Group Separation (UG-Sep), a novel framework that enables reusable user-side computation in dense interaction models for the first time. UG-Sep introduces a masking mechanism that explicitly disentangles user-side and item-side information flows within token-mixing layers, ensuring that a subset of tokens to preserve purely user-side representations across layers. This design enables corresponding token computations to be reused across multiple samples, significantly reducing redundant inference cost. To compensate for potential expressiveness loss induced by masking, we further propose an Information Compensation strategy that adaptively reconstructs suppressed user-item interactions. Moreover, as UG-Sep substantially reduces user-side FLOPs and exposes memory-bound components, we incorporate W8A16 (8-bit weight, 16-bit activation) weight-only quantization to alleviate memory bandwidth bottlenecks and achieve additional acceleration. We conduct extensive offline evaluations and large-scale online A/B experiments at ByteDance, demonstrating that UG-Sep reduces inference latency by up to 20 percent without degrading online user experience or commercial metrics across multiple business scenarios, including feed recommendation and advertising systems.
Problem

Research questions and friction points this paper is trying to address.

large recommendation models
dense feature interaction
computation reuse
inference cost
user-item entanglement
Innovation

Methods, ideas, or system contributions that make the work stand out.

User-Group Separation
KV caching reuse
dense interaction models
information compensation
weight-only quantization
🔎 Similar Papers
Hui Lu
Hui Lu
Department of Computer Science and Engineering (CSE), the University of Texas at Arlington (UTA)
Cloud ComputingVirtualizationFile and Storage SystemsComputer NetworksComputer Systems
Zheng Chai
Zheng Chai
ByteDance
Machine LearningSequential ModelingRecommendationAnomaly Detection/Recognition
S
Shipeng Bai
ByteDance AML
H
Hao Zhang
ByteDance
Zhifang Fan
Zhifang Fan
Alibaba
Natural Language ProcessingInformation RetrievalRecommender System
K
Kunmin Bai
ByteDance AML
Y
Yingwen Wu
ByteDance
B
Bingzheng Wei
ByteDance
Xiang Sun
Xiang Sun
Professor of Economics, Wuhan University
Matching and Market DesignGame Theory and Information EconomicsSocial and Economic Networks
Z
Ziyan Gong
ByteDance AML
T
Tianyi Liu
ByteDance AML
H
Hua Chen
ByteDance AML
D
Deping Xie
ByteDance AML
Z
Zhongkai Chen
ByteDance
Z
Zhiliang Guo
ByteDance
Q
Qiwei Chen
ByteDance
Y
Yuchao Zheng
ByteDance AML