CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large language models (LLMs) exhibit insufficient reliability in Chinese legal knowledge retrieval and reasoning. Method: We introduce CLaw, the first fine-grained Chinese legal evaluation benchmark, comprising article-level corpora from 306 national statutes and reasoning tasks derived from 254 Supreme People’s Court adjudicated cases. CLaw innovatively incorporates historical revision timestamps of legal provisions to enable temporally sensitive assessment and explicitly decouples knowledge retrieval from logical reasoning for independent evaluation. Leveraging case-driven task design, supervised fine-tuning (SFT), and retrieval-augmented generation (RAG) in tandem, we conduct empirical analysis. Results: State-of-the-art LLMs demonstrate significant deficiencies in precise statutory reproduction. Findings underscore that robust legal reasoning necessitates tight integration of accurate knowledge retrieval and strong deductive capability. CLaw establishes a critical benchmark and methodological foundation for evaluating and advancing Chinese legal LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes, yet their reliability is often compromised by general pre-training that ingests legal texts without specialized focus, obscuring the true depth of their legal knowledge. This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning. CLaw comprises two key components: (1) a comprehensive, fine-grained corpus of all 306 Chinese national statutes, segmented to the subparagraph level and incorporating precise historical revision timesteps for rigorous recall evaluation (64,849 entries), and (2) a challenging set of 254 case-based reasoning instances derived from China Supreme Court curated materials to assess the practical application of legal knowledge. Our empirical evaluation reveals that most contemporary LLMs significantly struggle to faithfully reproduce legal provisions. As accurate retrieval and citation of legal provisions form the basis of legal reasoning, this deficiency critically undermines the reliability of their responses. We contend that achieving trustworthy legal reasoning in LLMs requires a robust synergy of accurate knowledge retrieval--potentially enhanced through supervised fine-tuning (SFT) or retrieval-augmented generation (RAG)--and strong general reasoning capabilities. This work provides an essential benchmark and critical insights for advancing domain-specific LLM reasoning, particularly within the complex legal sphere.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to accurately recall Chinese legal provisions

Assessing legal reasoning capabilities on case-based instances from Supreme Court

Addressing reliability issues in legal text analysis and statute citation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained corpus with subparagraph-level segmentation

Case-based reasoning instances from Supreme Court materials

Synergy of knowledge retrieval and reasoning capabilities

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval