MultiJustice: A Chinese Dataset for Multi-Party, Multi-Charge Legal Prediction

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Legal judgment prediction (LJP) lacks systematic modeling for multi-defendant, multi-charge scenarios. Method: We introduce MPMCP, the first Chinese benchmark dataset for multi-party, multi-charge judgment prediction, covering four representative judicial settings. We propose a unified multi-party–multi-charge joint prediction framework and conduct the first comprehensive evaluation of mainstream legal large language models (e.g., InternLM2, Lawformer) on charge classification and sentence prediction under a consistent benchmark. Results: Model performance degrades significantly with increasing case complexity—particularly in Setting S4 (multi-defendant + multi-charge), where InternLM2 and Lawformer suffer F1 drops of 4.5% and 19.7%, respectively, relative to S1. This exposes critical limitations in current models’ capacity to capture complex judicial structures. Our work establishes a new, reproducible benchmark and evaluation paradigm for judicial AI research.

Technology Category

Application Category

📝 Abstract
Legal judgment prediction offers a compelling method to aid legal practitioners and researchers. However, the research question remains relatively under-explored: Should multiple defendants and charges be treated separately in LJP? To address this, we introduce a new dataset namely multi-person multi-charge prediction (MPMCP), and seek the answer by evaluating the performance of several prevailing legal large language models (LLMs) on four practical legal judgment scenarios: (S1) single defendant with a single charge, (S2) single defendant with multiple charges, (S3) multiple defendants with a single charge, and (S4) multiple defendants with multiple charges. We evaluate the dataset across two LJP tasks, i.e., charge prediction and penalty term prediction. We have conducted extensive experiments and found that the scenario involving multiple defendants and multiple charges (S4) poses the greatest challenges, followed by S2, S3, and S1. The impact varies significantly depending on the model. For example, in S4 compared to S1, InternLM2 achieves approximately 4.5% lower F1-score and 2.8% higher LogD, while Lawformer demonstrates around 19.7% lower F1-score and 19.0% higher LogD. Our dataset and code are available at https://github.com/lololo-xiao/MultiJustice-MPMCP.
Problem

Research questions and friction points this paper is trying to address.

Evaluating legal LLMs on multi-party multi-charge prediction scenarios
Assessing performance gaps in complex legal judgment prediction cases
Comparing model accuracy for single vs multiple defendants/charges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multi-person multi-charge prediction dataset
Evaluates legal LLMs on four judgment scenarios
Identifies S4 as most challenging scenario
🔎 Similar Papers
No similar papers found.
X
Xiao Wang
Saarland Informatics Campus, Saarland University, Saarland, Germany
Jiahuan Pei
Jiahuan Pei
Assistant professor at Vrije Universiteit Amsterdam (VU Amsterdam)
Dialogue SystemsNatural Language ProcessingInformation RetrievalMachine LearningOpen Science
D
Diancheng Shui
Wuhan University, Wuhan, China
Z
Zhiguang Han
Nanyang Technical University, Singapore, Singapore
X
Xin Sun
University of Amsterdam, Amsterdam, Netherlands
D
Dawei Zhu
Saarland Informatics Campus, Saarland University, Saarland, Germany
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning