Do Reviews Matter for Recommendations in the Era of Large Language Models?

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

In the era of large language models (LLMs), is user-generated review text still indispensable for recommender systems? This paper systematically investigates this question through zero-shot, few-shot, and fine-tuning experiments across eight benchmark datasets. We introduce RAREval—the first comprehensive, comment-aware evaluation framework for recommendation—supporting multi-dimensional attribution analysis, including comment ablation, perturbation, and cold-start scenarios. Our methodology integrates LLMs (e.g., LLaMA, Qwen) with traditional sequential recommenders (e.g., BERT4Rec, SASRec), leveraging prompt engineering and instruction tuning. Results demonstrate that LLM-based approaches significantly outperform classical models under data sparsity and cold-start conditions. Moreover, removing or perturbing reviews induces negligible performance degradation, indicating that LLMs can implicitly capture user preferences without explicit review signals. The core contribution lies in empirically revealing the diminishing reliance of modern recommenders on explicit review text, while providing a reproducible, extensible evaluation paradigm for comment-aware recommendation research.

Technology Category

Application Category

📝 Abstract

With the advent of large language models (LLMs), the landscape of recommender systems is undergoing a significant transformation. Traditionally, user reviews have served as a critical source of rich, contextual information for enhancing recommendation quality. However, as LLMs demonstrate an unprecedented ability to understand and generate human-like text, this raises the question of whether explicit user reviews remain essential in the era of LLMs. In this paper, we provide a systematic investigation of the evolving role of text reviews in recommendation by comparing deep learning methods and LLM approaches. Particularly, we conduct extensive experiments on eight public datasets with LLMs and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We further introduce a benchmarking evaluation framework for review-aware recommender systems, RAREval, to comprehensively assess the contribution of textual reviews to the recommendation performance of review-aware recommender systems. Our framework examines various scenarios, including the removal of some or all textual reviews, random distortion, as well as recommendation performance in data sparsity and cold-start user settings. Our findings demonstrate that LLMs are capable of functioning as effective review-aware recommendation engines, generally outperforming traditional deep learning approaches, particularly in scenarios characterized by data sparsity and cold-start conditions. In addition, the removal of some or all textual reviews and random distortion does not necessarily lead to declines in recommendation accuracy. These findings motivate a rethinking of how user preference from text reviews can be more effectively leveraged. All code and supplementary materials are available at: https://github.com/zhytk/RAREval-data-processing.

Problem

Research questions and friction points this paper is trying to address.

Investigates if user reviews remain essential for recommendations with LLMs.

Compares deep learning and LLM methods across eight datasets in various scenarios.

Assesses review impact using a framework for data sparsity and cold-start conditions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs outperform traditional deep learning in recommendations

Text reviews may not be essential for LLM-based recommendation systems

RAREval framework assesses review impact in various scenarios

🔎 Similar Papers

Review-based Recommender Systems: A Survey of Approaches, Challenges and Future Perspectives