JustEva: A Toolkit to Evaluate LLM Fairness in Legal Knowledge Inference

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) pose judicial fairness risks in legal reasoning due to their opaque “black-box” nature, yet no systematic fairness evaluation framework exists for the legal domain. Method: We propose JustEva—a novel, open-source evaluation toolkit—supporting structured output generation, multi-dimensional fairness quantification, and statistical inference. It introduces a fine-grained labeling scheme covering 65 non-legal attributes and defines three core fairness metrics: inconsistency, bias, and imbalanced errors. Visualization and regression analysis are integrated to ensure interpretability. Contribution/Results: Empirical evaluation reveals significant fairness deficiencies across mainstream LLMs on legal tasks. JustEva successfully identifies bias sources and guides targeted model refinement. It establishes a reproducible, scalable, and domain-specific assessment paradigm for trustworthy legal AI, advancing both methodological rigor and practical deployability in law-oriented LLM evaluation.

Technology Category

Application Category

📝 Abstract
The integration of Large Language Models (LLMs) into legal practice raises pressing concerns about judicial fairness, particularly due to the nature of their "black-box" processes. This study introduces JustEva, a comprehensive, open-source evaluation toolkit designed to measure LLM fairness in legal tasks. JustEva features several advantages: (1) a structured label system covering 65 extra-legal factors; (2) three core fairness metrics - inconsistency, bias, and imbalanced inaccuracy; (3) robust statistical inference methods; and (4) informative visualizations. The toolkit supports two types of experiments, enabling a complete evaluation workflow: (1) generating structured outputs from LLMs using a provided dataset, and (2) conducting statistical analysis and inference on LLMs' outputs through regression and other statistical methods. Empirical application of JustEva reveals significant fairness deficiencies in current LLMs, highlighting the lack of fair and trustworthy LLM legal tools. JustEva offers a convenient tool and methodological foundation for evaluating and improving algorithmic fairness in the legal domain.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM fairness in legal knowledge inference
Addressing judicial fairness concerns from black-box LLMs
Measuring inconsistency, bias, and imbalanced accuracy in legal AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source toolkit for legal fairness evaluation
Statistical inference methods and visualization tools
Structured label system covering extra-legal factors
Z
Zongyue Xue
Tsinghua University; Yale Law School, New Haven, Connecticut, U.S.
S
Siyuan Zheng
Tsinghua University; Shanghai Jiaotong University, Shanghai, China
S
Shaochun Wang
Tsinghua University, Beijing, China
Y
Yiran Hu
Tsinghua University; University of Waterloo, Waterloo, Ontario, Canada
Shenran Wang
Shenran Wang
Master of Science, UBC
Machine LearningNLP
Yuxin Yao
Yuxin Yao
University of Science and Technology of China
H
Haitao Li
Tsinghua University, Beijing, China
Qingyao Ai
Qingyao Ai
Associate Professor, Dept. of CS&T, Tsinghua University
Information RetrievalMachine Learning
Y
Yiqun Liu
Tsinghua University, Beijing, China
Y
Yun Liu
W
Weixing Shen
Tsinghua University, Beijing, China