🤖 AI Summary
This work addresses a critical limitation in industrial-scale click-through rate (CTR) prediction models, where early aggregation of user behavior sequences discards fine-grained signals, thereby constraining performance gains. To overcome this, the authors propose the Efficient Scalable Transformer (EST), which—unlike prior approaches—enables end-to-end, lossless unified sequence modeling for CTR prediction, avoiding information loss and highlighting fundamental differences from large language models. Central to EST are two novel mechanisms: Lightweight Cross-Attention (LCA) and Content-Sparse Attention (CSA), which jointly alleviate the information bottleneck while maintaining computational efficiency. Deployed on Taobao’s display advertising platform, EST achieves a 3.27% increase in revenue per mille (RPM) and a 1.22% improvement in CTR, demonstrating stable and efficient power-law scaling behavior.
📝 Abstract
Efficiently scaling industrial Click-Through Rate (CTR) prediction has recently attracted significant research attention. Existing approaches typically employ early aggregation of user behaviors to maintain efficiency. However, such non-unified or partially unified modeling creates an information bottleneck by discarding fine-grained, token-level signals essential for unlocking scaling gains. In this work, we revisit the fundamental distinctions between CTR prediction and Large Language Models (LLMs), identifying two critical properties: the asymmetry in information density between behavioral and non-behavioral features, and the modality-specific priors of content-rich signals. Accordingly, we propose the Efficiently Scalable Transformer (EST), which achieves fully unified modeling by processing all raw inputs in a single sequence without lossy aggregation. EST integrates two modules: Lightweight Cross-Attention (LCA), which prunes redundant self-interactions to focus on high-impact cross-feature dependencies, and Content Sparse Attention (CSA), which utilizes content similarity to dynamically select high-signal behaviors. Extensive experiments show that EST exhibits a stable and efficient power-law scaling relationship, enabling predictable performance gains with model scale. Deployed on Taobao's display advertising platform, EST significantly outperforms production baselines, delivering a 3.27\% RPM (Revenue Per Mile) increase and a 1.22\% CTR lift, establishing a practical pathway for scalable industrial CTR prediction models.