🤖 AI Summary
Evaluating large language models’ (LLMs) legal reasoning capabilities in corporate governance—particularly for charter compliance assessment—lacks standardized, domain-specific benchmarks.
Method: We introduce CHANCERY, the first benchmark dedicated to charter compliance judgment, constructed from 24 core governance principles and 79 real-world, cross-industry corporate charters. It formalizes compliance verification as a binary classification task: determining whether proposals by executives, boards, or shareholders conform to charter provisions. Crucially, we pioneer rule-constrained logical reasoning as the evaluation core, deploying ReAct and CodeAct reasoning agents for fine-grained, traceable assessment.
Contribution/Results: State-of-the-art LLMs achieve only 64.5%–75.2% accuracy; reasoning agents improve performance to 76.1%–78.1%, yet reveal persistent bottlenecks in handling cross-referenced clauses and inferring implicit obligations. CHANCERY fills a critical gap in legal reasoning evaluation for corporate governance and provides a foundational benchmark and diagnostic toolkit for trustworthy AI deployment in corporate law.
📝 Abstract
Law has long been a domain that has been popular in natural language processing (NLP) applications. Reasoning (ratiocination and the ability to make connections to precedent) is a core part of the practice of the law in the real world. Nevertheless, while multiple legal datasets exist, none have thus far focused specifically on reasoning tasks. We focus on a specific aspect of the legal landscape by introducing a corporate governance reasoning benchmark (CHANCERY) to test a model's ability to reason about whether executive/board/shareholder's proposed actions are consistent with corporate governance charters. This benchmark introduces a first-of-its-kind corporate governance reasoning test for language models - modeled after real world corporate governance law. The benchmark consists of a corporate charter (a set of governing covenants) and a proposal for executive action. The model's task is one of binary classification: reason about whether the action is consistent with the rules contained within the charter. We create the benchmark following established principles of corporate governance - 24 concrete corporate governance principles established in and 79 real life corporate charters selected to represent diverse industries from a total dataset of 10k real life corporate charters. Evaluations on state-of-the-art (SOTA) reasoning models confirm the difficulty of the benchmark, with models such as Claude 3.7 Sonnet and GPT-4o achieving 64.5% and 75.2% accuracy respectively. Reasoning agents exhibit superior performance, with agents based on the ReAct and CodeAct frameworks scoring 76.1% and 78.1% respectively, further confirming the advanced legal reasoning capabilities required to score highly on the benchmark. We also conduct an analysis of the types of questions which current reasoning models struggle on, revealing insights into the legal reasoning capabilities of SOTA models.