LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional evaluation metrics for legal-domain recommendation systems in the generative AI era fail to capture fine-grained quality aspects, especially for retrieval-augmented generation (RAG) systems. Method: This paper proposes an LLM-as-a-Judge paradigm, replacing simplistic agreement-based metrics with Gwet’s AC2 to quantify inter-annotator reliability. It integrates Spearman and Kendall rank correlation coefficients with Wilcoxon signed-rank tests and Benjamini–Hochberg correction to establish a robust statistical comparison framework for RAG output quality. Contribution/Results: Evaluated on legal document recommendation tasks, the approach achieves near-expert human-level discrimination accuracy (AC2 > 0.85), significantly improving assessment efficiency and scalability. It provides a reproducible, interpretable, and high-fidelity evaluation benchmark tailored for domain-specific generative AI systems.

Technology Category

Application Category

📝 Abstract
The evaluation bottleneck in recommendation systems has become particularly acute with the rise of Generative AI, where traditional metrics fall short of capturing nuanced quality dimensions that matter in specialized domains like legal research. Can we trust Large Language Models to serve as reliable judges of their own kind? This paper investigates LLM-as-a-Judge as a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts, where the stakes of recommendation quality are exceptionally high. We tackle two fundamental questions that determine practical viability: which inter-rater reliability metrics best capture the alignment between LLM and human assessments, and how do we conduct statistically sound comparisons between competing systems? Through systematic experimentation, we discover that traditional agreement metrics like Krippendorff's alpha can be misleading in the skewed distributions typical of AI system evaluations. Instead, Gwet's AC2 and rank correlation coefficients emerge as more robust indicators for judge selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections provides the statistical rigor needed for reliable system comparisons. Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications, transforming what was once a human-intensive bottleneck into an automated, yet statistically principled, evaluation framework.
Problem

Research questions and friction points this paper is trying to address.

Evaluating legal document recommendation quality in AI systems
Determining reliable metrics for LLM-human assessment alignment
Establishing statistically sound comparisons between competing systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-Judge evaluation method
Gwet's AC2 reliability metrics
Wilcoxon Signed-Rank statistical comparisons
A
Anu Pradhan
Bloomberg, New York, NY, USA
A
Alexandra Ortan
Bloomberg, New York, NY, USA
A
Apurv Verma
Bloomberg, New York, NY, USA
Madhavan Seshadri
Madhavan Seshadri
Columbia University
RoboticsMulti-Agent SystemsMachine LearningHigh Performance Computing