๐ค AI Summary
Existing agent-based recommendation methods struggle to model implicit item relationships (e.g., substitutability or complementarity), tend to hallucinate items, fail to scale efficiently to full-catalog ranking, and inadequately leverage large language modelsโ (LLMs) commonsense reasoning capabilities. To address these limitations, we propose AgenDRโa novel LLM-agent-driven recommendation framework. First, it employs an LLM agent that explicitly infers semantic item relationships from user interaction history. Second, it introduces a personalized tool-selection mechanism that dynamically fuses LLM-generated rankings with collaborative filtering outputs. Third, it defines a new LLM evaluation metric jointly optimizing semantic alignment and ranking accuracy. Extensive experiments on three public grocery datasets demonstrate that AgenDR achieves, on average, a 2ร improvement in full-catalog ranking performance over state-of-the-art baselines, while significantly enhancing recommendation relevance, interpretability, and scalability.
๐ Abstract
Recent agent-based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non-existent items and full-catalog ranking. Besides, a largely underexplored opportunity lies in leveraging LLMs'commonsense reasoning to capture user intent through substitute and complement relationships between items, which are usually implicit in datasets and difficult for traditional ID-based recommenders to capture. In this work, we propose a novel LLM-agent framework, AgenDR, which bridges LLM reasoning with scalable recommendation tools. Our approach delegates full-ranking tasks to traditional models while utilizing LLMs to (i) integrate multiple recommendation outputs based on personalized tool suitability and (ii) reason over substitute and complement relationships grounded in user history. This design mitigates hallucination, scales to large catalogs, and enhances recommendation relevance through relational reasoning. Through extensive experiments on three public grocery datasets, we show that our framework achieves superior full-ranking performance, yielding on average a twofold improvement over its underlying tools. We also introduce a new LLM-based evaluation metric that jointly measures semantic alignment and ranking correctness.