🤖 AI Summary
This work addresses the limitations of existing legal reasoning benchmarks, which are costly, static, and ill-suited for pinpointing specific failure modes of models under complex regulatory constraints. The authors propose a dynamic task generation framework grounded in a symbolic representation of U.S. bankruptcy code provisions, enabling on-demand synthesis of natural language questions paired with machine-computable answers. The framework supports fine-grained control over task complexity and scope, facilitating diagnostic evaluation of targeted reasoning capabilities. By integrating an expert-constructed legal symbolic system with a dynamic generation mechanism, the authors construct a new benchmark comprising 9,765 samples. Evaluations across 13 language models reveal significant performance degradation specifically in scenarios involving long reasoning chains or the presence of distractor statements.
📝 Abstract
Reasoning benchmarks have played a crucial role in the progress of language models. Yet rigorous evaluation remains a significant challenge as static question-answer pairs provide only a snapshot of performance, compressing complex behavior into a single accuracy metric. This limitation is especially true in complex, rule-bound domains such as law, where existing benchmarks are costly to build and ill suited for isolating specific failure modes. To address this, we introduce OpenExempt, a framework and benchmark for diagnostic evaluation of legal reasoning. The OpenExempt Framework uses expert-crafted symbolic representations of U.S. Bankruptcy Code statutes to dynamically generate a large space of natural language reasoning tasks and their machine-computable solutions on demand. This gives users fine-grained control over task complexity and scope, allowing individual reasoning skills to be probed in isolation. Using this system, we construct the OpenExempt Benchmark, a diagnostic benchmark for legal reasoning with 9,765 samples across nine evaluation suites designed to carefully probe model capabilities. Experiments on 13 diverse language models reveal sharp performance cliffs that emerge only under longer reasoning paths and in the presence of obfuscating statements. We release the framework and benchmark publicly to support research aimed at understanding and improving the next generation of reasoning systems.