🤖 AI Summary
This study identifies significant instability in large language models’ (LLMs) legal interpretation capabilities and their misalignment with human expert judgment: minor syntactic or phrasing variations in prompts induce substantial output fluctuations, and model predictions correlate only weakly to moderately with legal experts’ assessments (mean Pearson’s *r* < 0.4). To rigorously evaluate LLMs’ fidelity to human legal reasoning, we conduct the first multi-round empirical investigation across leading models—including GPT-4, Claude, and Llama3—using structured prompt rewriting to quantify response consistency. We benchmark outputs against a large-scale, expert-annotated dataset (*N* > 1,200 judgments from practicing legal professionals). Results demonstrate that current LLMs lack cross-prompt robustness in legal interpretation, exhibit considerable inter-model variability, and fail to reliably emulate human legal reasoning patterns—highlighting substantial reliability concerns for judicial deployment.
📝 Abstract
Legal interpretation frequently involves assessing how a legal text, as understood by an 'ordinary' speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.