Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating the trustworthiness and interpretability of reasoning trajectories generated by large language models (LLMs) in legal domains remains challenging due to the lack of structured, domain-specific benchmarks. Method: We introduce LEGIT, the first large-scale, structured legal reasoning dataset, modeling court judgments as hierarchical problem trees comprising opposing arguments and judicial conclusions. We propose a fine-grained, rubric-based evaluation framework grounded in these legal problem trees, quantifying reasoning coverage and correctness. Further, we integrate retrieval-augmented generation (RAG) with rubric-guided reinforcement learning (RL): RAG enhances reasoning completeness, while RL improves logical correctness—yielding synergistic performance gains. Contribution/Results: This work pioneers the joint integration of structured argument modeling, expert annotation, and rubric-driven RL for legal reasoning evaluation. It establishes a reproducible methodological framework and a foundational benchmark resource for trustworthy AI in specialized domains.

Technology Category

Application Category

📝 Abstract
Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-generated legal reasoning trace quality
Introducing LEGIT dataset for legal reasoning evaluation
Assessing legal issue coverage and correctness in reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical legal issue trees for evaluation
Large-scale expert-level dataset for reasoning assessment
RAG and RL with rubrics enhance legal reasoning
🔎 Similar Papers
No similar papers found.