🤖 AI Summary
Legal AI lacks high-fidelity, reasoning-driven RAG evaluation benchmarks, hindering the development of domain-specific models. Method: We construct two real-world legal RAG benchmarks—Bar Exam Question Answering and Housing Regulation Question Answering—emphasizing the deep coupling of legal retrieval with causal reasoning and statutory interrelation understanding. We introduce the first legal-reasoning-oriented retrieval evaluation paradigm, featuring practice-informed annotation protocols and task design. Using human-annotated question-answer pairs, we systematically evaluate mainstream retrievers (ColBERT, ANCE, BM25), measuring top-10 recall. Results: All evaluated models achieve sub-45% top-10 recall, exposing critical weaknesses in legal semantic and inferential reasoning. This work fills a key gap by establishing the first reasoning-centric legal RAG benchmark, providing a reproducible evaluation framework and concrete optimization directions for legal-domain retrieval model development.
📝 Abstract
As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs ("RAG"systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal research tasks, and were produced through annotation processes which resemble legal research. We describe the construction of these benchmarks and the performance of existing retriever pipelines. Our results suggest that legal RAG remains a challenging application, thus motivating future research.