A Reasoning-Focused Legal Retrieval Benchmark

📅 2025-03-25
🏛️ Symposium on Computer Science and Law
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
Legal AI lacks high-fidelity, reasoning-driven RAG evaluation benchmarks, hindering the development of domain-specific models. Method: We construct two real-world legal RAG benchmarks—Bar Exam Question Answering and Housing Regulation Question Answering—emphasizing the deep coupling of legal retrieval with causal reasoning and statutory interrelation understanding. We introduce the first legal-reasoning-oriented retrieval evaluation paradigm, featuring practice-informed annotation protocols and task design. Using human-annotated question-answer pairs, we systematically evaluate mainstream retrievers (ColBERT, ANCE, BM25), measuring top-10 recall. Results: All evaluated models achieve sub-45% top-10 recall, exposing critical weaknesses in legal semantic and inferential reasoning. This work fills a key gap by establishing the first reasoning-centric legal RAG benchmark, providing a reproducible evaluation framework and concrete optimization directions for legal-domain retrieval model development.

Technology Category

Application Category

📝 Abstract
As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs ("RAG"systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal research tasks, and were produced through annotation processes which resemble legal research. We describe the construction of these benchmarks and the performance of existing retriever pipelines. Our results suggest that legal RAG remains a challenging application, thus motivating future research.
Problem

Research questions and friction points this paper is trying to address.

Lack realistic legal RAG benchmarks complexity
Need improve legal retrieval question-answering performance
Challenging legal RAG application future research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces legal RAG benchmarks for realistic evaluation
Focuses on legal retrieval and question-answering complexity
Evaluates existing retriever pipelines' performance
🔎 Similar Papers
No similar papers found.
L
Lucia Zheng
Stanford University
Neel Guha
Neel Guha
Stanford Computer Science, Stanford Law
artificial intelligencemachine learningtortscivil procedureregulatory policy
J
Javokhir Arifov
Stanford University
S
Sarah Zhang
Stanford University
M
Michal Skreta
Stanford University
C
Christopher D. Manning
Stanford University
Peter Henderson
Peter Henderson
Princeton University
Machine LearningLaw
Daniel E. Ho
Daniel E. Ho
Stanford University
Regulatory policyartificial intelligenceadministrative lawantidiscrimination