MARS: Modality-Aligned Retrieval for Sequence Augmented CTR Prediction

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address degraded CTR prediction performance caused by behavioral sparsity among low-activity users, this paper proposes the MARS framework. First, it employs the Stein kernel method to achieve unbiased semantic alignment across image and text modalities, constructing a unified multimodal embedding space. Second, leveraging high-activity users’ behavioral sequences, it enhances low-activity users’ behavioral representations via cross-modal retrieval, similarity-based sequence selection, and aggregation. Unlike conventional collaborative filtering—which heavily relies on explicit interaction signals—MARS explicitly integrates item-level multimodal features. Extensive offline experiments and online A/B tests on the Kuaishou platform demonstrate significant improvements in CTR estimation accuracy and substantial gains in core business metrics. The framework has been fully deployed in production, serving hundreds of millions of users.

Technology Category

Application Category

📝 Abstract

Click-through rate (CTR) prediction serves as a cornerstone of recommender systems. Despite the strong performance of current CTR models based on user behavior modeling, they are still severely limited by interaction sparsity, especially in low-active user scenarios. To address this issue, data augmentation of user behavior is a promising research direction. However, existing data augmentation methods heavily rely on collaborative signals while overlooking the rich multimodal features of items, leading to insufficient modeling of low-active users. To alleviate this problem, we propose a novel framework extbf{MARS} ( extbf{M}odality- extbf{A}ligned extbf{R}etrieval for extbf{S}equence Augmented CTR Prediction). MARS utilizes a Stein kernel-based approach to align text and image features into a unified and unbiased semantic space to construct multimodal user embeddings. Subsequently, each low-active user's behavior sequence is augmented by retrieving, filtering, and concentrating the most similar behavior sequence of high-active users via multimodal user embeddings. Validated by extensive offline experiments and online A/B tests, our framework MARS consistently outperforms state-of-the-art baselines and achieves substantial growth on core business metrics within Kuaishou~footnote{https://www.kuaishou.com/}. Consequently, MARS has been successfully deployed, serving the main traffic for hundreds of millions of users. To ensure reproducibility, we provide anonymous access to the implementation code~footnote{https://github.com/wangshukuan/MARS}.

Problem

Research questions and friction points this paper is trying to address.

Addressing interaction sparsity in CTR prediction models

Augmenting user behavior sequences with multimodal features

Improving recommendations for low-active users via retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns text and image features via Stein kernel

Augments user behavior sequences with multimodal embeddings

Retrieves similar behaviors from high-active users

🔎 Similar Papers

No similar papers found.