Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study addresses the critical issue of benchmark data leakage in the evaluation of large language models (LLMs) for recommender systems, which often leads to inflated or degraded performance estimates. We systematically uncover and validate, for the first time, the memory effect of LLMs on recommendation benchmark data acquired during pretraining or fine-tuning, revealing a dual-impact mechanism: in-domain data leakage substantially inflates performance metrics, whereas out-of-domain leakage impairs recommendation accuracy. To investigate this phenomenon, we construct a hybrid corpus combining both in-domain and out-of-domain user–item interactions and conduct experiments using continual pretraining strategies. Our findings demonstrate that data leakage is a pivotal factor undermining the reliability of LLM-based recommendation evaluations, offering crucial methodological cautions and practical guidance for future research aiming to establish trustworthy assessment protocols.

Technology Category

Application Category

📝 Abstract

The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre-training or fine-tuning, leading to artificially inflated performance metrics that fail to reflect true model performance. To validate this phenomenon, we simulate diverse data leakage scenarios by conducting continued pre-training of foundation models on strategically blended corpora, which include user-item interactions from both in-domain and out-of-domain sources. Our experiments reveal a dual-effect of data leakage: when the leaked data is domain-relevant, it induces substantial but spurious performance gains, misleadingly exaggerating the model's capability. In contrast, domain-irrelevant leakage typically degrades recommendation accuracy, highlighting the complex and contingent nature of this contamination. Our findings reveal that data leakage acts as a critical, previously unaccounted-for factor in LLM-based recommendation, which could impact the true model performance. We release our code at https://github.com/yusba1/LLMRec-Data-Leakage.

Problem

Research questions and friction points this paper is trying to address.

benchmark leakage

LLM-based recommendation

evaluation reliability

data contamination

performance inflation

Innovation

Methods, ideas, or system contributions that make the work stand out.

benchmark leakage

LLM-based recommendation

data contamination