Unleashing the Potential of Sparse Attention on Long-term Behaviors for CTR Prediction

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the high computational complexity of standard self-attention in modeling long user behavior sequences, which hinders deployment in industrial recommender systems, and the failure of existing sparse attention methods to capture personalized and temporally dynamic user behaviors. To this end, we propose SparseCTR, a novel model featuring a personalized chunking strategy and a three-branch sparse self-attention mechanism that jointly captures global interests, interest transitions, and short-term interests. SparseCTR further incorporates a learnable composite relative time encoding to effectively model temporal and periodic relationships among user actions. Notably, it is the first CTR prediction method to exhibit a scaling law across three orders of magnitude in FLOPs. Experiments demonstrate that SparseCTR outperforms state-of-the-art approaches in offline metrics and achieves a 1.72% increase in click-through rate (CTR) and a 1.41% gain in cost per mille (CPM) in online A/B tests, while significantly reducing computational overhead.

Technology Category

Application Category

📝 Abstract

In recent years, the success of large language models (LLMs) has driven the exploration of scaling laws in recommender systems. However, models that demonstrate scaling laws are actually challenging to deploy in industrial settings for modeling long sequences of user behaviors, due to the high computational complexity of the standard self-attention mechanism. Despite various sparse self-attention mechanisms proposed in other fields, they are not fully suited for recommendation scenarios. This is because user behaviors exhibit personalization and temporal characteristics: different users have distinct behavior patterns, and these patterns change over time, with data from these users differing significantly from data in other fields in terms of distribution. To address these challenges, we propose SparseCTR, an efficient and effective model specifically designed for long-term behaviors of users. To be precise, we first segment behavior sequences into chunks in a personalized manner to avoid separating continuous behaviors and enable parallel processing of sequences. Based on these chunks, we propose a three-branch sparse self-attention mechanism to jointly identify users'global interests, interest transitions, and short-term interests. Furthermore, we design a composite relative temporal encoding via learnable, head-specific bias coefficients, better capturing sequential and periodic relationships among user behaviors. Extensive experimental results show that SparseCTR not only improves efficiency but also outperforms state-of-the-art methods. More importantly, it exhibits an obvious scaling law phenomenon, maintaining performance improvements across three orders of magnitude in FLOPs. In online A/B testing, SparseCTR increased CTR by 1.72\% and CPM by 1.41\%. Our source code is available at https://github.com/laiweijiang/SparseCTR.

Problem

Research questions and friction points this paper is trying to address.

sparse attention

CTR prediction

long-term user behavior

personalization

temporal dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Attention

Long-term User Behavior

Scaling Law