🤖 AI Summary
Existing legal benchmarking datasets conflate factual recall with reasoning, neglecting the integrity and quality of the reasoning process. Method: We introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real judicial decisions, and formally operationalize the IRAC (Issue, Rule, Application, Conclusion) framework as a fine-grained, evaluable multi-step reasoning structure. A human-in-the-loop annotation pipeline ensures high-quality step-level labeling. We empirically demonstrate that model-induced chain-of-thought (CoT) significantly outperforms handcrafted prompts in coherence and accuracy. Contribution/Results: Evaluation reveals that state-of-the-art LLMs achieve only moderate performance on MSLR, exposing critical gaps in legal reasoning capability; model-induced CoT yields consistent gains. The dataset, annotation guidelines, and code are fully open-sourced, establishing a new benchmark and foundational infrastructure for legal AI reasoning research.
📝 Abstract
Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.