🤖 AI Summary
Financial RAG systems suffer from low factual accuracy and insufficient evaluation of retrieval capabilities; existing QA benchmarks often rely on predefined contexts, failing to reflect realistic retrieval scenarios. To address this, we introduce FinRAG-Bench—the first professionally annotated, domain-specific benchmark for evaluating financial RAG systems—comprising 5,703 query–expert-annotated evidence–authoritative answer triplets derived from real user queries. It explicitly covers domain-specific challenges such as abbreviations, lexical ambiguity, and elliptical expressions. Crucially, FinRAG-Bench adopts a collaborative annotation paradigm involving financial domain experts who identify relevant evidence directly from large-scale unstructured corpora, eliminating reliance on pre-specified contexts and emphasizing end-to-end retrieval-and-generation capability. Using this benchmark, we systematically evaluate state-of-the-art retrieval models and large language models, uncovering critical weaknesses in financial terminology comprehension, relevant evidence recall, and factual consistency. FinRAG-Bench thus provides a reproducible, realistic evaluation framework and concrete directions for advancing trustworthy financial RAG research.
📝 Abstract
In the fast-paced financial domain, accurate and up-to-date information is critical to addressing ever-evolving market conditions. Retrieving this information correctly is essential in financial Question-Answering (QA), since many language models struggle with factual accuracy in this domain. We present FinDER, an expert-generated dataset tailored for Retrieval-Augmented Generation (RAG) in finance. Unlike existing QA datasets that provide predefined contexts and rely on relatively clear and straightforward queries, FinDER focuses on annotating search-relevant evidence by domain experts, offering 5,703 query-evidence-answer triplets derived from real-world financial inquiries. These queries frequently include abbreviations, acronyms, and concise expressions, capturing the brevity and ambiguity common in the realistic search behavior of professionals. By challenging models to retrieve relevant information from large corpora rather than relying on readily determined contexts, FinDER offers a more realistic benchmark for evaluating RAG systems. We further present a comprehensive evaluation of multiple state-of-the-art retrieval models and Large Language Models, showcasing challenges derived from a realistic benchmark to drive future research on truthful and precise RAG in the financial domain.