LARGE: Legal Retrieval Augmented Generation Evaluation Tool

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of evaluating how multiple interdependent components jointly affect judicial question-answering performance in legal-domain RAG systems. To this end, we propose LRAGE—the first open-source, multilingual, full-pipeline explainable evaluation framework for legal RAG. LRAGE supports both GUI and CLI interfaces and systematically decouples five core components: retrieval corpus, retrieval algorithm, re-ranker, LLM backbone, and evaluation metrics—enabling cross-jurisdictional benchmarking (Chinese, English, Korean) and component-wise attribution analysis. Leveraging mainstream tools—including Elasticsearch/FAISS for retrieval, BERT-based re-rankers, and Llama/Qwen LLMs—we quantitatively assess each component’s impact on accuracy across three legal benchmarks: LegalBench, LawBench, and KBL. Experiments demonstrate significant improvements in RAG optimization efficiency and deployment reliability for judicial applications. The framework is publicly available under an open-source license.

Technology Category

Application Category

📝 Abstract
Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at https://github.com/hoorangyee/LRAGE.
Problem

Research questions and friction points this paper is trying to address.

Evaluates legal RAG systems' performance across components
Assesses impact of retrieval algorithms on legal accuracy
Validates multilingual legal benchmarks for RAG evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source tool for legal RAG evaluation
GUI and CLI interfaces for seamless experiments
Multilingual validation with legal benchmarks
🔎 Similar Papers
No similar papers found.