Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study identifies significant instability in large language models’ (LLMs) legal interpretation capabilities and their misalignment with human expert judgment: minor syntactic or phrasing variations in prompts induce substantial output fluctuations, and model predictions correlate only weakly to moderately with legal experts’ assessments (mean Pearson’s *r* < 0.4). To rigorously evaluate LLMs’ fidelity to human legal reasoning, we conduct the first multi-round empirical investigation across leading models—including GPT-4, Claude, and Llama3—using structured prompt rewriting to quantify response consistency. We benchmark outputs against a large-scale, expert-annotated dataset (*N* > 1,200 judgments from practicing legal professionals). Results demonstrate that current LLMs lack cross-prompt robustness in legal interpretation, exhibit considerable inter-model variability, and fail to reliably emulate human legal reasoning patterns—highlighting substantial reliability concerns for judicial deployment.

Technology Category

Application Category

📝 Abstract

Legal interpretation frequently involves assessing how a legal text, as understood by an 'ordinary' speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.

Problem

Research questions and friction points this paper is trying to address.

LLMs produce unstable legal interpretations across question formats

LLM judgments show weak correlation with human legal assessments

Generative AI conclusions are unreliable for legal decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs provide unstable legal interpretive judgments

Varying question formats leads to different conclusions

Weak correlation exists between AI and human judgments

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval