Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the capability boundaries of large language models (LLMs) in hierarchical legal reasoning—particularly case-based reasoning. We propose a novel three-stage hierarchical reasoning framework, introduce the first verifiable legal knowledge hierarchy and factor-based rule system, and design formal modeling techniques, factor extraction methods, and multi-stage controllable evaluation protocols. Experimental results demonstrate that while LLMs achieve high accuracy on superficial tasks, performance degrades markedly with increasing reasoning depth: accuracy drops to 64.82%–92.09% on Task II and plummets further to 11.46%–33.99% on Task III. Critically, deeper errors correlate with substantially higher computational overhead, revealing a counterintuitive “longer thinking does not imply better reasoning” phenomenon. Our study establishes a new evaluation paradigm for complex-domain reasoning and delivers a reproducible, rigorously benchmarked framework for assessing hierarchical legal inference capabilities.

Technology Category

Application Category

📝 Abstract
Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that "thinking longer" does not always mean "thinking smarter." Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM capabilities in hierarchical legal reasoning
Assessing proficiency in nuanced case-based legal analogies
Analyzing computational resource expenditure versus reasoning accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposing legal reasoning into three-stage tasks
Modeling cases using factors in legal hierarchy
Defining verifiable rules for case distinction analysis