🤖 AI Summary
This work addresses the high-stakes problem of fairness deficits in large language models (LLMs) for judicial decision-making. Methodologically, it introduces the first theory-grounded evaluation framework based on judicial fairness principles: a comprehensive, 65-dimension, 161-value annotation schema; the open-source JudiFair dataset comprising 177,100 real-world case facts; and three quantifiable fairness metrics—inconsistency, bias, and imbalanced error—applied to empirically assess 16 mainstream LLMs. Results reveal pervasive demographic biases across models, with higher accuracy often exacerbating—not mitigating—bias; temperature emerges as a critical fairness-sensitive hyperparameter, whereas model scale, release date, and country of origin show no statistically significant effects. The project releases all data, code, and evaluation tooling publicly, establishing a foundational benchmark and infrastructure for fairness research in legal AI.
📝 Abstract
Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity. However, LLMs' judicial fairness and implications for social justice remain underexplored. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy, and introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover pervasive inconsistency, bias, and imbalanced inaccuracy across models, underscoring severe LLM judicial unfairness. Particularly, LLMs display notably more pronounced biases on demographic labels, with slightly less bias on substance labels compared to procedure ones. Interestingly, increased inconsistency correlates with reduced biases, but more accurate predictions exacerbate biases. While we find that adjusting the temperature parameter can influence LLM fairness, model size, release date, and country of origin do not exhibit significant effects on judicial fairness. Accordingly, we introduce a publicly available toolkit containing all datasets and code, designed to support future research in evaluating and improving LLM fairness.