Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing legal benchmarking datasets conflate factual recall with reasoning, neglecting the integrity and quality of the reasoning process. Method: We introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real judicial decisions, and formally operationalize the IRAC (Issue, Rule, Application, Conclusion) framework as a fine-grained, evaluable multi-step reasoning structure. A human-in-the-loop annotation pipeline ensures high-quality step-level labeling. We empirically demonstrate that model-induced chain-of-thought (CoT) significantly outperforms handcrafted prompts in coherence and accuracy. Contribution/Results: Evaluation reveals that state-of-the-art LLMs achieve only moderate performance on MSLR, exposing critical gaps in legal reasoning capability; model-induced CoT yields consistent gains. The dataset, annotation guidelines, and code are fully open-sourced, establishing a new benchmark and foundational infrastructure for legal AI reasoning research.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in existing legal reasoning benchmarks for LLMs
Developing first Chinese multi-step legal reasoning dataset MSLR
Evaluating Chain-of-Thought effects on complex legal reasoning quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Chinese multi-step legal reasoning dataset MSLR
Uses IRAC framework to model structured expert reasoning
Designs Human-LLM collaborative annotation pipeline for annotations
🔎 Similar Papers
No similar papers found.
W
Wenhan Yu
School of Artificial Intelligence, Beihang University, Beijing, China
X
Xinbo Lin
KoGuan School of Law, Shanghai Jiao Tong University, Shanghai, China
L
Lanxin Ni
School of Criminal Justice, China University of Political Science and Law, Beijing, China
J
Jinhua Cheng
KoGuan School of Law, Shanghai Jiao Tong University, Shanghai, China
Lei Sha
Lei Sha
Prof@Beihang University, Prof@ZGC Lab, Oxtium AI, University of Oxford
NLPML