CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions

📅 2024-12-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reliable evaluation of automatic summarization for long legal texts—particularly U.S. Supreme Court opinions—remains an open challenge due to the absence of comprehensive, expert-validated benchmarks. Method: We introduce CaseSumm, the first ultra-long legal summarization benchmark covering 25.6K Supreme Court opinions and their official summaries dating back to 1815. We propose the first cross-century SCOTUS summarization evaluation framework, integrating automated metrics (ROUGE, BERTScore) with rigorous legal expert annotation. Contribution/Results: Our analysis reveals a significant misalignment between automated scores and human expert judgments: smaller models (e.g., Mistral-7B), despite superior metric performance, frequently generate factual inaccuracies and erroneous precedent citations—hallmarks of hallucination—whereas GPT-4 summaries receive consistently higher expert ratings. We demonstrate that standard automatic metrics are unreliable in high-stakes legal domains and publicly release the CaseSumm dataset to advance trustworthy legal AI research.

Technology Category

Application Category

📝 Abstract
This paper introduces CaseSumm, a novel dataset for long-context summarization in the legal domain that addresses the need for longer and more complex datasets for summarization evaluation. We collect 25.6K U.S. Supreme Court (SCOTUS) opinions and their official summaries, known as"syllabuses."Our dataset is the largest open legal case summarization dataset, and is the first to include summaries of SCOTUS decisions dating back to 1815. We also present a comprehensive evaluation of LLM-generated summaries using both automatic metrics and expert human evaluation, revealing discrepancies between these assessment methods. Our evaluation shows Mistral 7b, a smaller open-source model, outperforms larger models on most automatic metrics and successfully generates syllabus-like summaries. In contrast, human expert annotators indicate that Mistral summaries contain hallucinations. The annotators consistently rank GPT-4 summaries as clearer and exhibiting greater sensitivity and specificity. Further, we find that LLM-based evaluations are not more correlated with human evaluations than traditional automatic metrics. Furthermore, our analysis identifies specific hallucinations in generated summaries, including precedent citation errors and misrepresentations of case facts. These findings demonstrate the limitations of current automatic evaluation methods for legal summarization and highlight the critical role of human evaluation in assessing summary quality, particularly in complex, high-stakes domains. CaseSumm is available at https://huggingface.co/datasets/ChicagoHAI/CaseSumm
Problem

Research questions and friction points this paper is trying to address.

Automatic Summarization
Legal Domain
Evaluation Metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

CaseSumm
Legal Domain Summarization
Automated vs Human Evaluation
🔎 Similar Papers
No similar papers found.