Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery

📅 2024-05-29

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Electronic discovery (eDiscovery) faces significant challenges—including complex legal semantics, dense entity references, and pronounced long-tail distributions—making it difficult for existing methods to simultaneously achieve high retrieval accuracy and interpretability. This paper proposes DISCOG, a two-stage framework: first, it constructs a heterogeneous graph linking legal documents, entities, and clauses, and employs graph neural networks (GNNs) for high-precision relevance ranking; second, it leverages prompt engineering to guide large language models (LLMs) in generating legally grounded, interpretable reasoning. DISCOG pioneers a synergistic paradigm integrating graph learning and LLMs, effectively breaking the traditional trade-off among performance, throughput, and transparency. Experiments demonstrate improvements of 12%, 3%, and 16% in F1-score, precision, and recall, respectively. In enterprise deployment, DISCOG reduces costs by 99.9% compared to manual review and by 95% compared to pure LLM-based classification.

Technology Category

Application Category

📝 Abstract

Electronic Discovery (eDiscovery) involves identifying relevant documents from a vast collection based on legal production requests. The integration of artificial intelligence (AI) and natural language processing (NLP) has transformed this process, helping document review and enhance efficiency and cost-effectiveness. Although traditional approaches like BM25 or fine-tuned pre-trained models are common in eDiscovery, they face performance, computational, and interpretability challenges. In contrast, Large Language Model (LLM)-based methods prioritize interpretability but sacrifice performance and throughput. This paper introduces DISCOvery Graph (DISCOG), a hybrid approach that combines the strengths of two worlds: a heterogeneous graph-based method for accurate document relevance prediction and subsequent LLM-driven approach for reasoning. Graph representational learning generates embeddings and predicts links, ranking the corpus for a given request, and the LLMs provide reasoning for document relevance. Our approach handles datasets with balanced and imbalanced distributions, outperforming baselines in F1-score, precision, and recall by an average of 12%, 3%, and 16%, respectively. In an enterprise context, our approach drastically reduces document review costs by 99.9% compared to manual processes and by 95% compared to LLM-based classification methods

Problem

Research questions and friction points this paper is trying to address.

Identifying relevant legal documents from large collections

Handling legal entities, citations, and complex artifacts

Improving document ranking and classification efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates knowledge graphs for document ranking

Uses LLM-driven reasoning for classification

Reduces document review costs by 98%

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval