🤖 AI Summary
This paper addresses the challenge of event coreference resolution in legal contracts—characterized by extreme document length, high event density, and exceptionally long coreference spans (ranging from short- to ultra-long-distance). To tackle this, we introduce LegalCore, the first domain-specific benchmark for legal event coreference. Comprising contracts up to 25K tokens, LegalCore systematically defines and manually annotates legal events along with their multi-granularity coreferential relations, exposing unique difficulties posed by high density and ultra-long-distance coreference. Building on LegalCore, we establish a joint evaluation benchmark for event detection and coreference resolution, assessing both mainstream open- and closed-weight large language models (LLMs). Experimental results show that state-of-the-art LLMs underperform significantly relative to supervised baselines, underscoring the task’s difficulty and the necessity of domain adaptation. The dataset, annotation guidelines, and code are fully open-sourced.
📝 Abstract
Recognizing events and their coreferential mentions in a document is essential for understanding semantic meanings of text. The existing research on event coreference resolution is mostly limited to news articles. In this paper, we present the first dataset for the legal domain, LegalCore, which has been annotated with comprehensive event and event coreference information. The legal contract documents we annotated in this dataset are several times longer than news articles, with an average length of around 25k tokens per document. The annotations show that legal documents have dense event mentions and feature both short-distance and super long-distance coreference links between event mentions. We further benchmark mainstream Large Language Models (LLMs) on this dataset for both event detection and event coreference resolution tasks, and find that this dataset poses significant challenges for state-of-the-art open-source and proprietary LLMs, which perform significantly worse than a supervised baseline. We will publish the dataset as well as the code.