No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study addresses the challenges of assessing the reliability and scalability of large language models (LLMs) as automated evaluators in recommendation systems. Methodologically, we propose the first fully human-free, large-scale multi-agent evaluation framework, featuring a consensus-driven protocol that integrates pattern auditing and question encoding. Thirty-six diverse LLMs collaboratively assess recommendation outputs, with ground-truth labels generated via majority voting—enabling reproducible and scalable automated benchmarking. Our key contribution is the first LLM-as-judge evaluation paradigm eliminating manual annotation. Experimental results demonstrate that Gemini 1.5 Pro achieves the highest overall performance; Claude 3.5 Sonnet exhibits the greatest decision confidence; GPT-4o delivers optimal cost-performance trade-offs; the open-source GPT-OSS 20B outperforms other open models; and structured domain-specific consensus consistently proves significantly stronger than unstructured alternatives.

Technology Category

Application Category

📝 Abstract

Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama, across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark reports four key findings: (i) Anthropic Claude 3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers the best overall performance across categories; (iii) GPT-4o provides the most favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B leads among open-source models. Category-level analysis shows strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). These results establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs as judges, with actionable guidance on scaling, reliability, and model family tradeoffs.

Problem

Research questions and friction points this paper is trying to address.

Systematically comparing 36 LLMs as evaluators for recommendation systems

Developing automated evaluation without human annotation via multi-agent framework

Analyzing model performance tradeoffs across categories and consensus reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework aggregates audits via majority voting

Consensus-driven protocol enables reproducible LLM evaluator comparison

Scalable evaluation pipeline eliminates human annotation requirements

🔎 Similar Papers

Negotiating the Shared Agency between Humans & AI in the Recommender System