Transformer-Based Extraction of Statutory Definitions from the U.S. Code

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges of scattered, structurally opaque, and ambiguously scoped statutory definitions across the U.S. Code (exceeding 200,000 pages). We propose an end-to-end, multi-stage NLP pipeline for joint extraction of defined terms, definition content, and jurisdictional scope. Methodologically, we integrate XML structure parsing with a domain-adapted attention mechanism, design a definition-unit aggregation strategy, and introduce the first high-precision jurisdictional scope identification approach—leveraging Legal-BERT fine-tuning, paragraph-level classification, attention-guided term–definition alignment, and rule-enhanced scope recognition. Evaluated on multiple titles of the U.S. Code, our system achieves 96.8% precision, 98.9% recall (F1 = 98.2%), significantly outperforming prior methods. This work establishes a novel paradigm for semantic structuring of large-scale legal texts.

Technology Category

Application Category

📝 Abstract
Automatic extraction of definitions from legal texts is critical for enhancing the comprehension and clarity of complex legal corpora such as the United States Code (U.S.C.). We present an advanced NLP system leveraging transformer-based architectures to automatically extract defined terms, their definitions, and their scope from the U.S.C. We address the challenges of automatically identifying legal definitions, extracting defined terms, and determining their scope within this complex corpus of over 200,000 pages of federal statutory law. Building upon previous feature-based machine learning methods, our updated model employs domain-specific transformers (Legal-BERT) fine-tuned specifically for statutory texts, significantly improving extraction accuracy. Our work implements a multi-stage pipeline that combines document structure analysis with state-of-the-art language models to process legal text from the XML version of the U.S. Code. Each paragraph is first classified using a fine-tuned legal domain BERT model to determine if it contains a definition. Our system then aggregates related paragraphs into coherent definitional units and applies a combination of attention mechanisms and rule-based patterns to extract defined terms and their jurisdictional scope. The definition extraction system is evaluated on multiple titles of the U.S. Code containing thousands of definitions, demonstrating significant improvements over previous approaches. Our best model achieves 96.8% precision and 98.9% recall (98.2% F1-score), substantially outperforming traditional machine learning classifiers. This work contributes to improving accessibility and understanding of legal information while establishing a foundation for downstream legal reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Automatically extract legal definitions from U.S. Code
Identify defined terms and their jurisdictional scope
Improve accuracy using domain-specific transformer models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based NLP for legal text extraction
Fine-tuned Legal-BERT for definition identification
Multi-stage pipeline combining structure and language models