The Material Contracts Corpus

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the lack of large-scale, open, structured legal contract data for empirical contract research and AI-powered legal tool development. Methodologically, it constructs the first comprehensive, publicly available corpus of U.S. public company SEC contracts (2000–2023; >1 million documents), integrating document parsing, multidimensional metadata modeling, fine-grained agreement-type annotation, and contracting-party entity linking; contract classification is automated via LLaMA-2 fine-tuning, achieving >92% accuracy. Key contributions include: (1) releasing the first high-quality, downloadable, and queryable legal contract dataset; (2) uncovering longitudinal trends—over two decades—in contractual language, length, and syntactic/semantic complexity; and (3) identifying employment and guarantee agreements as the most prevalent contract types. This corpus establishes foundational infrastructure for legal natural language processing and computational law research.

Technology Category

Application Category

📝 Abstract
This paper introduces the Material Contracts Corpus (MCC), a publicly available dataset comprising over one million contracts filed by public companies with the U.S. Securities and Exchange Commission (SEC) between 2000 and 2023. The MCC facilitates empirical research on contract design and legal language, and supports the development of AI-based legal tools. Contracts in the corpus are categorized by agreement type and linked to specific parties using machine learning and natural language processing techniques, including a fine-tuned LLaMA-2 model for contract classification. The MCC further provides metadata such as filing form, document format, and amendment status. We document trends in contractual language, length, and complexity over time, and highlight the dominance of employment and security agreements in SEC filings. This resource is available for bulk download and online access at https://mcc.law.stanford.edu.
Problem

Research questions and friction points this paper is trying to address.

Creating a public dataset of SEC-filed contracts for research
Enabling AI-based legal tool development with categorized contracts
Analyzing trends in contract language and complexity over time
Innovation

Methods, ideas, or system contributions that make the work stand out.

Public dataset of SEC contracts
Machine learning for contract classification
Metadata provision for legal research
🔎 Similar Papers
No similar papers found.