🤖 AI Summary
Evaluating the trustworthiness and interpretability of reasoning trajectories generated by large language models (LLMs) in legal domains remains challenging due to the lack of structured, domain-specific benchmarks.
Method: We introduce LEGIT, the first large-scale, structured legal reasoning dataset, modeling court judgments as hierarchical problem trees comprising opposing arguments and judicial conclusions. We propose a fine-grained, rubric-based evaluation framework grounded in these legal problem trees, quantifying reasoning coverage and correctness. Further, we integrate retrieval-augmented generation (RAG) with rubric-guided reinforcement learning (RL): RAG enhances reasoning completeness, while RL improves logical correctness—yielding synergistic performance gains.
Contribution/Results: This work pioneers the joint integration of structured argument modeling, expert annotation, and rubric-driven RL for legal reasoning evaluation. It establishes a reproducible methodological framework and a foundational benchmark resource for trustworthy AI in specialized domains.
📝 Abstract
Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.