Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the latency and computational cost bottlenecks in modeling user behavior sequences of tens of thousands of items for short-video recommendation, this paper proposes an end-to-end long-sequence modeling framework. Methodologically, it introduces Stacked Target-to-History Cross-Attention (STCA), reducing time complexity from *O*(*L*²) to *O*(*L*); Request-Level Batching (RLB), which shares user representations across samples to amortize encoding overhead; and a “short-training, long-inference” length extrapolation strategy to enhance generalization. Deployed at scale on Douyin, the framework enables real-time inference over 10k-length sequences under stringent latency constraints. It achieves significant improvements in user engagement metrics and marks the first scalable industrial deployment of ultra-long sequence modeling in a large-scale recommender system—demonstrating both engineering feasibility and practical performance limits of long-sequence modeling.

Technology Category

Application Category

📝 Abstract
Short-video recommenders such as Douyin must exploit extremely long user histories without breaking latency or cost budgets. We present an end-to-end system that scales long-sequence modeling to 10k-length histories in production. First, we introduce Stacked Target-to-History Cross Attention (STCA), which replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear in sequence length and enabling efficient end-to-end training. Second, we propose Request Level Batching (RLB), a user-centric batching scheme that aggregates multiple targets for the same user/request to share the user-side encoding, substantially lowering sequence-related storage, communication, and compute without changing the learning objective. Third, we design a length-extrapolative training strategy -- train on shorter windows, infer on much longer ones -- so the model generalizes to 10k histories without additional training cost. Across offline and online experiments, we observe predictable, monotonic gains as we scale history length and model capacity, mirroring the scaling law behavior observed in large language models. Deployed at full traffic on Douyin, our system delivers significant improvements on key engagement metrics while meeting production latency, demonstrating a practical path to scaling end-to-end long-sequence recommendation to the 10k regime.
Problem

Research questions and friction points this paper is trying to address.

Modeling 10k-length user histories efficiently for short-video recommendation
Reducing quadratic complexity to linear in sequence length processing
Achieving production latency while scaling model capacity and engagement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stacked cross-attention reduces complexity to linear
User-centric batching shares encoding across targets
Length-extrapolative training enables long inference without cost
🔎 Similar Papers
No similar papers found.
L
Lin Guan
ByteDance
Jia-Qi Yang
Jia-Qi Yang
ByteDance
machine learningdata miningrecommender systems
Z
Zhishan Zhao
ByteDance
Beichuan Zhang
Beichuan Zhang
Professor of Computer Science, the University of Arizona
Computer Networks
B
Bo Sun
ByteDance
X
Xuanyuan Luo
ByteDance
J
Jinan Ni
ByteDance
X
Xiaowen Li
ByteDance
Y
Yuhang Qi
ByteDance
Zhifang Fan
Zhifang Fan
Alibaba
Natural Language ProcessingInformation RetrievalRecommender System
Hangyu Wang
Hangyu Wang
Shanghai Jiao Tong University
Information RetrievalRecommender System
Q
Qiwei Chen
ByteDance
Y
Yi Cheng
ByteDance
F
Feng Zhang
ByteDance
X
Xiao Yang
ByteDance