🤖 AI Summary
Existing evaluations of large language model–based recommender systems primarily focus on semantic plausibility or small-scale re-ranking, lacking a comprehensive assessment of set-level behavioral utility—such as relevance, complementarity, and diversity. This work proposes RecoAtlas, a benchmark and toolkit that decouples behaviorally aligned set-level utility evaluation from semantic plausibility for the first time. RecoAtlas integrates utility proxy models learned from interaction data, a controllable tool-calling environment, and multidimensional metrics to holistically evaluate shopping recommendation agents. Experiments demonstrate that RecoAtlas exhibits strong discriminative power: performance scales with model size and reasoning computation, relies on high-quality tools, and reveals inconsistencies between semantic plausibility and behavioral effectiveness.
📝 Abstract
LLM recommendation agents increasingly produce structured recommendation reports: sets of items accompanied by natural-language justifications. Yet existing evaluations often reduce this setting to reranking small shortlisted candidate sets or judge reports mainly by semantic plausibility. We introduce Recommendation Atlas (Agentic Tool-Level Assessment for Shopping), or RecoAtlas, a benchmark and toolkit for evaluating shopping agents with behavior-grounded metrics. RecoAtlas complements held-out interaction metrics with learned utility proxies for relevance, complementarity, and diversity derived from interaction data, while separately measuring semantic coherence and explanation quality. Its controlled tool environment exposes agents to either semantic, behavior-aligned, or faulty tools, enabling diagnosis of whether performance gains arise from stronger reasoning, better signals, or more effective tool-use policies. Across controlled experiments, we show that RecoAtlas exhibits key properties of a meaningful benchmark for agentic systems: performance scales with model capacity and test-time compute, improves with stronger and better-aligned tools, degrades under noisy or misaligned signals, and reveals that semantic plausibility does not necessarily capture behavior-grounded utility. RecoAtlas provides a foundation for developing and evaluating shopping assistants that optimize not only for plausible recommendations, but also for coherent, behaviorally grounded recommendation sets.