Revisiting Prompt Engineering: A Comprehensive Evaluation for LLM-based Personalized Recommendation

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study investigates the impact of prompt engineering on large language models (LLMs) in single-user personalized recommendation—specifically under cold-start, cross-domain, and zero-shot settings. We conduct large-scale experiments across eight public datasets, twelve LLMs, and twenty-three prompt variants, employing statistical hypothesis testing and linear mixed-effects modeling to quantify trade-offs between recommendation accuracy and inference cost induced by prompt design. Results reveal that larger LLMs achieve higher accuracy with concise prompts, whereas smaller- and medium-sized models benefit significantly from instruction rephrasing, background knowledge injection, and enhanced reasoning readability—challenging the common assumption that more complex prompts universally improve performance. To our knowledge, this is the first work to systematically characterize the coupling between model scale and optimal prompting strategies. Our findings provide empirical grounding and practical guidelines for jointly optimizing accuracy and efficiency in LLM-based recommender systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can perform recommendation tasks by taking prompts written in natural language as input. Compared to traditional methods such as collaborative filtering, LLM-based recommendation offers advantages in handling cold-start, cross-domain, and zero-shot scenarios, as well as supporting flexible input formats and generating explanations of user behavior. In this paper, we focus on a single-user setting, where no information from other users is used. This setting is practical for privacy-sensitive or data-limited applications. In such cases, prompt engineering becomes especially important for controlling the output generated by the LLM. We conduct a large-scale comparison of 23 prompt types across 8 public datasets and 12 LLMs. We use statistical tests and linear mixed-effects models to evaluate both accuracy and inference cost. Our results show that for cost-efficient LLMs, three types of prompts are especially effective: those that rephrase instructions, consider background knowledge, and make the reasoning process easier to follow. For high-performance LLMs, simple prompts often outperform more complex ones while reducing cost. In contrast, commonly used prompting styles in natural language processing, such as step-by-step reasoning, or the use of reasoning models often lead to lower accuracy. Based on these findings, we provide practical suggestions for selecting prompts and LLMs depending on the required balance between accuracy and cost.

Problem

Research questions and friction points this paper is trying to address.

Evaluating prompt engineering for LLM-based personalized recommendations

Comparing 23 prompt types across 8 datasets and 12 LLMs

Identifying cost-effective prompts for accuracy and inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates 23 prompt types across diverse datasets

Focuses on single-user privacy-sensitive recommendation settings

Identifies cost-effective prompts for different LLM performance levels

🔎 Similar Papers

Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis