Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing legal benchmarking datasets conflate factual recall with reasoning, neglecting the integrity and quality of the reasoning process. Method: We introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real judicial decisions, and formally operationalize the IRAC (Issue, Rule, Application, Conclusion) framework as a fine-grained, evaluable multi-step reasoning structure. A human-in-the-loop annotation pipeline ensures high-quality step-level labeling. We empirically demonstrate that model-induced chain-of-thought (CoT) significantly outperforms handcrafted prompts in coherence and accuracy. Contribution/Results: Evaluation reveals that state-of-the-art LLMs achieve only moderate performance on MSLR, exposing critical gaps in legal reasoning capability; model-induced CoT yields consistent gains. The dataset, annotation guidelines, and code are fully open-sourced, establishing a new benchmark and foundational infrastructure for legal AI reasoning research.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in existing legal reasoning benchmarks for LLMs

Developing first Chinese multi-step legal reasoning dataset MSLR

Evaluating Chain-of-Thought effects on complex legal reasoning quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Chinese multi-step legal reasoning dataset MSLR

Uses IRAC framework to model structured expert reasoning

Designs Human-LLM collaborative annotation pipeline for annotations

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval