Unleashing the Potential of Sparse Attention on Long-term Behaviors for CTR Prediction

📅 2026-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational complexity of standard self-attention in modeling long user behavior sequences, which hinders deployment in industrial recommender systems, and the failure of existing sparse attention methods to capture personalized and temporally dynamic user behaviors. To this end, we propose SparseCTR, a novel model featuring a personalized chunking strategy and a three-branch sparse self-attention mechanism that jointly captures global interests, interest transitions, and short-term interests. SparseCTR further incorporates a learnable composite relative time encoding to effectively model temporal and periodic relationships among user actions. Notably, it is the first CTR prediction method to exhibit a scaling law across three orders of magnitude in FLOPs. Experiments demonstrate that SparseCTR outperforms state-of-the-art approaches in offline metrics and achieves a 1.72% increase in click-through rate (CTR) and a 1.41% gain in cost per mille (CPM) in online A/B tests, while significantly reducing computational overhead.

Technology Category

Application Category

📝 Abstract
In recent years, the success of large language models (LLMs) has driven the exploration of scaling laws in recommender systems. However, models that demonstrate scaling laws are actually challenging to deploy in industrial settings for modeling long sequences of user behaviors, due to the high computational complexity of the standard self-attention mechanism. Despite various sparse self-attention mechanisms proposed in other fields, they are not fully suited for recommendation scenarios. This is because user behaviors exhibit personalization and temporal characteristics: different users have distinct behavior patterns, and these patterns change over time, with data from these users differing significantly from data in other fields in terms of distribution. To address these challenges, we propose SparseCTR, an efficient and effective model specifically designed for long-term behaviors of users. To be precise, we first segment behavior sequences into chunks in a personalized manner to avoid separating continuous behaviors and enable parallel processing of sequences. Based on these chunks, we propose a three-branch sparse self-attention mechanism to jointly identify users'global interests, interest transitions, and short-term interests. Furthermore, we design a composite relative temporal encoding via learnable, head-specific bias coefficients, better capturing sequential and periodic relationships among user behaviors. Extensive experimental results show that SparseCTR not only improves efficiency but also outperforms state-of-the-art methods. More importantly, it exhibits an obvious scaling law phenomenon, maintaining performance improvements across three orders of magnitude in FLOPs. In online A/B testing, SparseCTR increased CTR by 1.72\% and CPM by 1.41\%. Our source code is available at https://github.com/laiweijiang/SparseCTR.
Problem

Research questions and friction points this paper is trying to address.

sparse attention
CTR prediction
long-term user behavior
personalization
temporal dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Attention
Long-term User Behavior
Scaling Law
Personalized Chunking
Relative Temporal Encoding
🔎 Similar Papers
No similar papers found.
W
Weijiang Lai
Institute of Software, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China
Beihong Jin
Beihong Jin
Institute of Software, Chinese Academy of Sciences
Pervasive ComputingDistributed Computing
D
Di Zhang
Meituan, Beijing, China
S
Siru Chen
Meituan, Beijing, China
J
Jiongyan Zhang
Meituan, Beijing, China
Y
Yuhang Gou
Meituan, Beijing, China
Jian Dong
Jian Dong
Shopee
Computer VisionMachine Learning
X
Xingxing Wang
Meituan, Beijing, China