🤖 AI Summary
Evaluating alignment quality of multimodal large language models (MLLMs) in recommender systems remains challenging due to lagging traditional metrics, high-cost online A/B testing, and lack of interpretable improvement guidance.
Method: We propose the Leakage Impact Score (LIS), a representation-quality metric grounded in the theoretical upper bound of preference data leakage—quantifying how much user preference information an MLLM’s representations inherently encode. LIS enables efficient, offline diagnosis of the performance ceiling imposed by MLLM representations, without requiring online deployment.
Contribution/Results: LIS is interpretable, low-overhead, and highly sensitive. Integrated with offline analysis and online validation, it drove measurable improvements in key business metrics on Xiaohongshu’s discovery feed and advertising platforms—specifically increasing user session duration and advertiser ROI—demonstrating its effectiveness and practical value in large-scale industrial recommender systems.
📝 Abstract
Multimodal recommendation has emerged as a critical technique in modern recommender systems, leveraging content representations from advanced multimodal large language models (MLLMs). To ensure these representations are well-adapted, alignment with the recommender system is essential. However, evaluating the alignment of MLLMs for recommendation presents significant challenges due to three key issues: (1) static benchmarks are inaccurate because of the dynamism in real-world applications, (2) evaluations with online system, while accurate, are prohibitively expensive at scale, and (3) conventional metrics fail to provide actionable insights when learned representations underperform. To address these challenges, we propose the Leakage Impact Score (LIS), a novel metric for multimodal recommendation. Rather than directly assessing MLLMs, LIS efficiently measures the upper bound of preference data. We also share practical insights on deploying MLLMs with LIS in real-world scenarios. Online A/B tests on both Content Feed and Display Ads of Xiaohongshu's Explore Feed production demonstrate the effectiveness of our proposed method, showing significant improvements in user spent time and advertiser value.