MARS: Modality-Aligned Retrieval for Sequence Augmented CTR Prediction

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address degraded CTR prediction performance caused by behavioral sparsity among low-activity users, this paper proposes the MARS framework. First, it employs the Stein kernel method to achieve unbiased semantic alignment across image and text modalities, constructing a unified multimodal embedding space. Second, leveraging high-activity users’ behavioral sequences, it enhances low-activity users’ behavioral representations via cross-modal retrieval, similarity-based sequence selection, and aggregation. Unlike conventional collaborative filtering—which heavily relies on explicit interaction signals—MARS explicitly integrates item-level multimodal features. Extensive offline experiments and online A/B tests on the Kuaishou platform demonstrate significant improvements in CTR estimation accuracy and substantial gains in core business metrics. The framework has been fully deployed in production, serving hundreds of millions of users.

Technology Category

Application Category

📝 Abstract
Click-through rate (CTR) prediction serves as a cornerstone of recommender systems. Despite the strong performance of current CTR models based on user behavior modeling, they are still severely limited by interaction sparsity, especially in low-active user scenarios. To address this issue, data augmentation of user behavior is a promising research direction. However, existing data augmentation methods heavily rely on collaborative signals while overlooking the rich multimodal features of items, leading to insufficient modeling of low-active users. To alleviate this problem, we propose a novel framework extbf{MARS} ( extbf{M}odality- extbf{A}ligned extbf{R}etrieval for extbf{S}equence Augmented CTR Prediction). MARS utilizes a Stein kernel-based approach to align text and image features into a unified and unbiased semantic space to construct multimodal user embeddings. Subsequently, each low-active user's behavior sequence is augmented by retrieving, filtering, and concentrating the most similar behavior sequence of high-active users via multimodal user embeddings. Validated by extensive offline experiments and online A/B tests, our framework MARS consistently outperforms state-of-the-art baselines and achieves substantial growth on core business metrics within Kuaishou~footnote{https://www.kuaishou.com/}. Consequently, MARS has been successfully deployed, serving the main traffic for hundreds of millions of users. To ensure reproducibility, we provide anonymous access to the implementation code~footnote{https://github.com/wangshukuan/MARS}.
Problem

Research questions and friction points this paper is trying to address.

Addressing interaction sparsity in CTR prediction models
Augmenting user behavior sequences with multimodal features
Improving recommendations for low-active users via retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns text and image features via Stein kernel
Augments user behavior sequences with multimodal embeddings
Retrieves similar behaviors from high-active users
🔎 Similar Papers
No similar papers found.
Y
Yutian Xiao
Beihang University
S
Shukuan Wang
Kuaishou Technology Co., Ltd.
B
Binhao Wang
City University of Hong Kong
Z
Zhao Zhang
Beihang University
Yanze Zhang
Yanze Zhang
University of Illinois at Chicago
RoboticsMulti-Agent SystemsAutonomous DrivingRobot LearningMachine Vision
Shanqi Liu
Shanqi Liu
Control science and engineering,Zhejiang University
Reinforcement learning
Chao Feng
Chao Feng
University of Zurich
networkmachine learningcybersecurity
X
Xiang Li
Kuaishou Technology Co., Ltd.
F
Fuzhen Zhuang
Beihang University