A Reasoning-Focused Legal Retrieval Benchmark

📅 2025-03-25

🏛️ Symposium on Computer Science and Law

📈 Citations: 2

✨ Influential: 1

career value

175K/year

🤖 AI Summary

Legal AI lacks high-fidelity, reasoning-driven RAG evaluation benchmarks, hindering the development of domain-specific models. Method: We construct two real-world legal RAG benchmarks—Bar Exam Question Answering and Housing Regulation Question Answering—emphasizing the deep coupling of legal retrieval with causal reasoning and statutory interrelation understanding. We introduce the first legal-reasoning-oriented retrieval evaluation paradigm, featuring practice-informed annotation protocols and task design. Using human-annotated question-answer pairs, we systematically evaluate mainstream retrievers (ColBERT, ANCE, BM25), measuring top-10 recall. Results: All evaluated models achieve sub-45% top-10 recall, exposing critical weaknesses in legal semantic and inferential reasoning. This work fills a key gap by establishing the first reasoning-centric legal RAG benchmark, providing a reproducible evaluation framework and concrete optimization directions for legal-domain retrieval model development.

Technology Category

Application Category

📝 Abstract

As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs ("RAG"systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal research tasks, and were produced through annotation processes which resemble legal research. We describe the construction of these benchmarks and the performance of existing retriever pipelines. Our results suggest that legal RAG remains a challenging application, thus motivating future research.

Problem

Research questions and friction points this paper is trying to address.

Lack realistic legal RAG benchmarks complexity

Need improve legal retrieval question-answering performance

Challenging legal RAG application future research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces legal RAG benchmarks for realistic evaluation

Focuses on legal retrieval and question-answering complexity

Evaluates existing retriever pipelines' performance

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval