ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Identification in Commercial Contracts

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses clause-level legal risk identification in commercial contracts, introducing ContractEval—the first fine-grained evaluation benchmark built upon CUAD—to systematically assess four proprietary and fifteen open-weight large language models (LLMs). Methodologically, we propose a multidimensional evaluation framework integrating accuracy, output validity, and reasoning-pattern analysis, while examining trade-offs among model scale, inference strategies (e.g., chain-of-thought), and quantization. Key findings include: (1) widespread “cognitive laziness” and low-confidence outputs among open-weight LLMs; (2) consistent superiority of proprietary models overall, yet several open-weight variants demonstrate competitive performance on specific dimensions; and (3) empirical evidence of diminishing returns with increasing model size, improved generation quality—but reduced accuracy—under advanced reasoning strategies, and accelerated inference—yet degraded precision—via quantization.

Technology Category

Application Category

📝 Abstract
The potential of large language models (LLMs) in specialized domains such as legal risk analysis remains underexplored. In response to growing interest in locally deploying open-source LLMs for legal tasks while preserving data confidentiality, this paper introduces ContractEval, the first benchmark to thoroughly evaluate whether open-source LLMs could match proprietary LLMs in identifying clause-level legal risks in commercial contracts. Using the Contract Understanding Atticus Dataset (CUAD), we assess 4 proprietary and 15 open-source LLMs. Our results highlight five key findings: (1) Proprietary models outperform open-source models in both correctness and output effectiveness, though some open-source models are competitive in certain specific dimensions. (2) Larger open-source models generally perform better, though the improvement slows down as models get bigger. (3) Reasoning ("thinking") mode improves output effectiveness but reduces correctness, likely due to over-complicating simpler tasks. (4) Open-source models generate "no related clause" responses more frequently even when relevant clauses are present. This suggests "laziness" in thinking or low confidence in extracting relevant content. (5) Model quantization speeds up inference but at the cost of performance drop, showing the tradeoff between efficiency and accuracy. These findings suggest that while most LLMs perform at a level comparable to junior legal assistants, open-source models require targeted fine-tuning to ensure correctness and effectiveness in high-stakes legal settings. ContractEval offers a solid benchmark to guide future development of legal-domain LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating open-source LLMs for legal risk identification in contracts
Comparing proprietary and open-source LLMs in clause-level legal analysis
Assessing performance tradeoffs in model size, reasoning, and quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LLMs for legal risk identification
Evaluating open-source vs proprietary LLMs
Assessing model performance on clause-level tasks