Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning

📅 2025-02-08

📈 Citations: 0

✨ Influential: 0

career value

138K/year

🤖 AI Summary

This study identifies systematic deficiencies in large language models’ (LLMs) stepwise legal reasoning for civil litigation multiple-choice questions—particularly in rule application, precedent analogy, and deductive-analogical synergistic reasoning. To address this, we propose the first fine-grained error taxonomy tailored to legal reasoning chains, encompassing core error types such as rule misapplication, precedent mismatch, and conflict imbalance. We further develop a reusable automated evaluation framework that quantifies both soundness (logical coherence) and correctness (factual accuracy) of reasoning via LLM-based assessment. An attribution-enhanced, prompt-engineering-driven error feedback mechanism is integrated to localize and interpret failures. Experiments demonstrate that our framework effectively pinpoints LLM weaknesses in analogical reasoning and rule-conflict resolution, yielding marginal yet consistent accuracy improvements. All components—including the diagnostic benchmark—are open-sourced to support scalable, reproducible legal reasoning evaluation.

Technology Category

Application Category

📝 Abstract

Reasoning abilities of LLMs have been a key focus in recent years. One challenging reasoning domain with interesting nuances is legal reasoning, which requires careful application of rules, and precedents while balancing deductive and analogical reasoning, and conflicts between rules. Although there have been a few works on using LLMs for legal reasoning, their focus has been on overall accuracy. In this paper, we dig deeper to do a step-by-step analysis and figure out where they commit errors. We use the college-level Multiple Choice Question-Answering (MCQA) task from the extit{Civil Procedure} dataset and propose a new error taxonomy derived from initial manual analysis of reasoning chains with respect to several LLMs, including two objective measures: soundness and correctness scores. We then develop an LLM-based automated evaluation framework to identify reasoning errors and evaluate the performance of LLMs. The computation of soundness and correctness on the dataset using the auto-evaluator framework reveals several interesting insights. Furthermore, we show that incorporating the error taxonomy as feedback in popular prompting techniques marginally increases LLM performance. Our work will also serve as an evaluation framework that can be used in detailed error analysis of reasoning chains for logic-intensive complex tasks.

Problem

Research questions and friction points this paper is trying to address.

Analyzing LLMs' step-by-step legal reasoning errors

Developing an automated evaluation framework for LLMs

Improving LLM performance using error taxonomy feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated LLM evaluation framework

Error taxonomy for legal reasoning

Prompting techniques with feedback

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval