LLM-based Embedders for Prior Case Retrieval

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Precedent case retrieval (PCR) in common law systems faces two key challenges: context loss due to long-text truncation and scarcity of labeled training data. To address these, this paper proposes an unsupervised large language model (LLM)-based embedding method that generates dense vector representations for lengthy legal texts without fine-tuning or human annotation—thereby overcoming the sequence-length limitations inherent in conventional BERT-style models. Experiments on four standard PCR benchmarks demonstrate that our approach significantly outperforms both BM25 and supervised Transformer baselines, confirming its efficacy in preserving legal semantic integrity while enabling accurate, efficient retrieval. Our primary contribution is the first systematic application of unsupervised LLM embeddings to PCR, jointly addressing long-context modeling and low-resource adaptability. This work establishes a scalable, annotation-light paradigm for legal information retrieval.

Technology Category

Application Category

📝 Abstract

In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.

Problem

Research questions and friction points this paper is trying to address.

Retrieving relevant prior cases efficiently from massive legal databases

Overcoming lengthy legal text limitations in BERT-based models

Addressing lack of legal training data for deep learning models

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based embedders for lengthy legal texts

Unsupervised approach eliminates training data need

Outperforms BM25 and supervised transformer models

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval