PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice

πŸ“… 2026-01-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing legal large language model (LLM) evaluation benchmarks are overly simplified and fail to assess the nuanced reasoning and ambiguity-handling capabilities required in real-world legal practice. To address this gap, this work proposes PLawBenchβ€”a benchmark grounded in authentic legal workflows, encompassing three core tasks: public legal consultation, case analysis, and legal document generation. It introduces, for the first time, a rubric-based fine-grained evaluation framework covering 13 practical scenarios and approximately 12,500 expert-designed scoring criteria. By integrating structured tasks, expert-annotated standards, and an LLM-based automatic evaluator aligned with human judgment, evaluations of ten state-of-the-art models reveal significant deficiencies in fine-grained legal reasoning, thereby demonstrating the effectiveness and necessity of the proposed benchmark.

Technology Category

Application Category

πŸ“ Abstract
As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.
Problem

Research questions and friction points this paper is trying to address.

legal benchmark
large language models
legal reasoning
real-world legal practice
evaluation rubrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric-based evaluation
legal reasoning
real-world legal practice
fine-grained assessment
LLM benchmarking
πŸ”Ž Similar Papers
Y
Yuzhen Shi
Alibaba Group
H
Huanghai Liu
Qwen Team, Alibaba Group
Y
Yiran Hu
University of Waterloo
G
Gaojie Song
Skylenage
X
Xinran Xu
Skylenage, Shanghai Jiao Tong University
Yubo Ma
Yubo Ma
Nanyang Technological University
Event ExtractionInformation ExtractionNatural Language Processing
Tianyi Tang
Tianyi Tang
Qwen Team, Alibaba Group & Renmin University of China
Artificial IntelligenceNatural Language Processing
Li Zhang
Li Zhang
University of Pittsburgh
Artificial Intelligence and Law
Q
Qingjing Chen
University of Bologna
Di Feng
Di Feng
Simons Laufer Mathematical Sciences Institute (Mathematical Sciences Research Institute)
Decision TheoryGame TheoryExperimental EconomicsMarket DesignFinancial Economics
W
Wenbo Lv
Skylenage
W
Weiheng Wu
Skylenage
Kexin Yang
Kexin Yang
Qwen Team
Natural Language ProcessingControllable Text Generation
S
Sen Yang
Qwen Team, Alibaba Group
Wei Wang
Wei Wang
Tongyi Lab, Alibaba Group
Generative Models
R
Rongyao Shi
Skylenage
Y
Yuanyang Qiu
Skylenage
Y
Yuemeng Qi
Skylenage
J
Jingwen Zhang
Skylenage
X
Xiaoyu Sui
Skylenage
Y
Yifan Chen
Alibaba Group
Y
Yi Zhang
Skylenage
An Yang
An Yang
Qwen Team, Peking University
Nature Language Processing (NLP)
Bowen Yu
Bowen Yu
Qwen Team, Alibaba Group
Post-trainingFoundation Model
Da Liu
Da Liu
North China Electric Power University
Energy Supply Chain Management
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining
W
Weixing Shen
Tsinghua University
Bing Zhao
Bing Zhao
SRI International
Natural Language ProcessingMachine LearningOptimizations
C
Charles L.A. Clarke
University of Waterloo
H
Huajie Wei
Alibaba Group