Large-Language Memorization During the Classification of United States Supreme Court Cases

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This study investigates the memory mechanisms and generalization capabilities of large language models (LLMs) in U.S. Supreme Court (SCOTUS) decision text classification—a challenging legal natural language understanding (NLU) task characterized by lengthy documents, dense domain-specific terminology, and nonstandard structural formatting. Method: We systematically analyze LLM reliance on memorization versus logical reasoning, and propose a memory-augmented paradigm integrating parameter-efficient fine-tuning, retrieval-augmented generation (RAG), and multi-scale prompt engineering. Contribution/Results: Our analysis reveals that LLMs—particularly those with explicit memory architectures (e.g., DeepSeek)—substantially outperform traditional models (e.g., BERT) on SCOTUS classification. On 15-class and 279-class fine-grained tasks, memory-augmented LLMs achieve approximately 2-percentage-point accuracy gains, demonstrating superior robustness and scalability for complex legal text understanding.

Technology Category

Application Category

📝 Abstract
Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
Problem

Research questions and friction points this paper is trying to address.

Investigating LLM memorization strategies in legal text classification
Evaluating LLM performance on complex SCOTUS case categorization tasks
Comparing prompt-based and BERT models for Supreme Court topic classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using prompt-based models with memory for classification
Applying parameter-efficient fine-tuning on legal texts
Testing retrieval-based approaches on SCOTUS corpus