Fine-grained Claim-level RAG Benchmark for Law

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the absence of fine-grained, multilingual, and user-diverse evaluation benchmarks for Retrieval-Augmented Generation (RAG) systems in the legal domain. To bridge this gap, the authors introduce ClaimRAG-LAW, the first claim-level RAG benchmark tailored to legal applications, encompassing English and French languages, both expert and non-expert users, and a wide range of real-world legal questions. Leveraging a modular evaluation framework, the study systematically diagnoses performance bottlenecks across retrieval, generation, and claim verification stages in prevailing legal RAG systems. By doing so, it fills critical voids in both depth and breadth left by existing benchmarks and provides the community with empirical insights and a standardized toolkit for future research.

📝 Abstract

The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

Problem

Research questions and friction points this paper is trying to address.

legal RAG

fine-grained evaluation

hallucination

multilingual benchmark

non-expert queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained evaluation

legal RAG

ClaimRAG-LAW