๐ค AI Summary
This study addresses the challenges of assessing the reliability and scalability of large language models (LLMs) as automated evaluators in recommendation systems. Methodologically, we propose the first fully human-free, large-scale multi-agent evaluation framework, featuring a consensus-driven protocol that integrates pattern auditing and question encoding. Thirty-six diverse LLMs collaboratively assess recommendation outputs, with ground-truth labels generated via majority votingโenabling reproducible and scalable automated benchmarking. Our key contribution is the first LLM-as-judge evaluation paradigm eliminating manual annotation. Experimental results demonstrate that Gemini 1.5 Pro achieves the highest overall performance; Claude 3.5 Sonnet exhibits the greatest decision confidence; GPT-4o delivers optimal cost-performance trade-offs; the open-source GPT-OSS 20B outperforms other open models; and structured domain-specific consensus consistently proves significantly stronger than unstructured alternatives.
๐ Abstract
Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama, across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark reports four key findings: (i) Anthropic Claude 3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers the best overall performance across categories; (iii) GPT-4o provides the most favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B leads among open-source models. Category-level analysis shows strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). These results establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs as judges, with actionable guidance on scaling, reliability, and model family tradeoffs.