JustEva: A Toolkit to Evaluate LLM Fairness in Legal Knowledge Inference

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Large language models (LLMs) pose judicial fairness risks in legal reasoning due to their opaque “black-box” nature, yet no systematic fairness evaluation framework exists for the legal domain. Method: We propose JustEva—a novel, open-source evaluation toolkit—supporting structured output generation, multi-dimensional fairness quantification, and statistical inference. It introduces a fine-grained labeling scheme covering 65 non-legal attributes and defines three core fairness metrics: inconsistency, bias, and imbalanced errors. Visualization and regression analysis are integrated to ensure interpretability. Contribution/Results: Empirical evaluation reveals significant fairness deficiencies across mainstream LLMs on legal tasks. JustEva successfully identifies bias sources and guides targeted model refinement. It establishes a reproducible, scalable, and domain-specific assessment paradigm for trustworthy legal AI, advancing both methodological rigor and practical deployability in law-oriented LLM evaluation.

Technology Category

Application Category

📝 Abstract

The integration of Large Language Models (LLMs) into legal practice raises pressing concerns about judicial fairness, particularly due to the nature of their "black-box" processes. This study introduces JustEva, a comprehensive, open-source evaluation toolkit designed to measure LLM fairness in legal tasks. JustEva features several advantages: (1) a structured label system covering 65 extra-legal factors; (2) three core fairness metrics - inconsistency, bias, and imbalanced inaccuracy; (3) robust statistical inference methods; and (4) informative visualizations. The toolkit supports two types of experiments, enabling a complete evaluation workflow: (1) generating structured outputs from LLMs using a provided dataset, and (2) conducting statistical analysis and inference on LLMs' outputs through regression and other statistical methods. Empirical application of JustEva reveals significant fairness deficiencies in current LLMs, highlighting the lack of fair and trustworthy LLM legal tools. JustEva offers a convenient tool and methodological foundation for evaluating and improving algorithmic fairness in the legal domain.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM fairness in legal knowledge inference

Addressing judicial fairness concerns from black-box LLMs

Measuring inconsistency, bias, and imbalanced accuracy in legal AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source toolkit for legal fairness evaluation

Statistical inference methods and visualization tools

Structured label system covering extra-legal factors

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval