Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of systematic evaluation of large language models (LLMs) in recommendation tasks, this paper introduces RecBench—the first multi-task benchmark specifically designed for recommendation—encompassing click-through rate (CTR) prediction and sequential recommendation (SeqRec) across five diverse domain-specific datasets. It uniformly evaluates 17 state-of-the-art LLMs against classical recommendation methods. A key contribution is the first systematic investigation into how different item representations—such as raw text and semantic embeddings—affect LLM-based recommendation performance. Results show up to a 170% improvement in NDCG@10 for SeqRec, albeit with substantially increased inference latency, highlighting a critical performance-efficiency trade-off. The evaluation comprehensively covers zero-shot and few-shot inference, multi-task modeling, and cross-domain generalization. All code, datasets, configurations, and the evaluation platform are publicly released.

Technology Category

Application Category

📝 Abstract
In recent years, integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce RecBench, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering the LLM-as-RS paradigm impractical for real-time recommendation environments. We aim for our findings to inspire future research, including recommendation-specific model acceleration methods. We will release our code, data, configurations, and platform to enable other researchers to reproduce and build upon our experimental results.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs vs traditional recommenders in recommendation tasks.
Assess LLM performance in CTR and SeqRec scenarios.
Analyze inference efficiency trade-offs in LLM-based recommenders.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RecBench for LLM evaluation
Compares LLMs with traditional recommenders
Highlights LLM performance vs. efficiency trade-offs
🔎 Similar Papers
No similar papers found.
Q
Qijiong Liu
The HK PolyU, Hong Kong SAR
J
Jieming Zhu
Huawei Noah’s Ark Lab, Hong Kong SAR
Lu Fan
Lu Fan
PolyU
graph mininglow-resource language understanding
K
Kun Wang
Nanyang Technology University, Singapore
Hengchang Hu
Hengchang Hu
National University of Singapore
Recommender SystemGraph Neural Network
W
Wei Guo
Huawei Noah’s Ark Lab, Singapore
Y
Yong Liu
Huawei Noah’s Ark Lab, Singapore
X
Xiao-Ming Wu
The HK PolyU, Hong Kong SAR