LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating LLM outputs in legal domains faces challenges including strong reliance on reference answers, poor adaptability of standardized metrics, and limited reliability of LLM-as-a-Judge approaches. To address these, we propose a reference-free evaluation framework: first, decomposing long legal responses into self-contained “Legal Data Points” (LDPs) that emulate lawyers’ case-reasoning logic; second, integrating legal text chunking, semantic completeness validation, and multi-dimensional reasoning alignment to enable interpretable, high-consistency automated assessment under the LLM-as-a-Judge paradigm. Evaluated on a proprietary legal dataset and LegalBench, our method significantly improves correlation with human expert judgments (+12.3% Spearman ρ) and inter-annotator agreement (Fleiss’ κ +0.28). We further release the first open-source collection of Legal Data Points to advance research in legal AI evaluation.

Technology Category

Application Category

📝 Abstract
Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into'Legal Data Points'(LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.
Problem

Research questions and friction points this paper is trying to address.

Evaluating legal LLM outputs faces complex challenges
Existing methods lack reliability for legal reasoning tasks
Proposing reference-free evaluation using legal data points
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes legal responses into self-contained Legal Data Points
Introduces reference-free evaluation mimicking lawyer assessments
Outperforms baselines on proprietary and open-source legal datasets
🔎 Similar Papers
No similar papers found.
J
Joseph Enguehard
Robin AI
M
Morgane Van Ermengem
Robin AI
K
Kate Atkinson
Robin AI
S
Sujeong Cha
Amazon Web Services
Arijit Ghosh Chowdhury
Arijit Ghosh Chowdhury
Amazon Web Services
NLPArtificial IntelligenceData Science
P
Prashanth Kallur Ramaswamy
Amazon Web Services
J
Jeremy Roghair
Amazon Web Services
H
Hannah R Marlowe
Amazon Web Services
C
Carina Suzana Negreanu
Robin AI
K
Kitty Boxall
Robin AI
Diana Mincu
Diana Mincu
RobinAI, ex-Google DeepMind
Machine LearningFairnessExplainability