Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)

๐Ÿ“… 2026-05-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

175K/year
๐Ÿค– AI Summary
Current retrieval-augmented generation (RAG) systems lack comprehensive and effective evaluation methodologies due to the inherent randomness in generation and the complex interactions between retrieval and generation components. This work proposes Deepchecks, a novel framework that establishes the first end-to-end evaluation system tailored specifically for RAG. Integrating multidimensional metrics, root-cause analysis, and production-level monitoring, Deepchecks enables continuous quality assurance throughout the entire lifecycleโ€”from development to deployment. The framework supports customizable evaluation pipelines adapted to specific application scenarios and delivers interpretable, actionable diagnostic insights. By doing so, it significantly enhances the trustworthiness and practical utility of RAG systems in high-stakes domains such as healthcare and finance, where reliability is paramount.
๐Ÿ“ Abstract
Large Language Models (LLMs) augmented with Retrieval-Augmented Generation (RAG) techniques are revolutionizing applications across multiple domains, such as healthcare, finance, and customer service. Despite their potential, evaluating RAG systems remains a complex challenge due to the stochastic nature of generated outputs and the intricate interplay between retrieval and generation components. This paper introduces Deepchecks, a comprehensive framework tailored for evaluating RAG applications. Deepchecks' evaluation framework addresses RAG applications evaluation through a multi-faceted approach, root cause analysis and production monitoring. By ensuring alignment with application-specific requirements, Deepchecks framework provides a robust foundation for assessing reliability, relevance, and user satisfaction in RAG systems.
Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation
RAG evaluation
Large Language Models
stochastic outputs
retrieval-generation interplay
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
RAG evaluation
Deepchecks
root cause analysis
production monitoring
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Assaf Gerner
Deepchecks, Ramat Gan, Israel
N
Netta Madvil
Deepchecks, Ramat Gan, Israel
N
Nadav Barak
Deepchecks, Ramat Gan, Israel
A
Alex Zaikman
Deepchecks, Ramat Gan, Israel
J
Jonatan Liberman
Deepchecks, Ramat Gan, Israel
L
Liron Hamra
Deepchecks, Ramat Gan, Israel
R
Rotem Brazilay
Deepchecks, Ramat Gan, Israel
S
Shay Tsadok
Deepchecks, Ramat Gan, Israel
Y
Yaron Friedman
Deepchecks, Ramat Gan, Israel
N
Neal Harow
Deepchecks, Ramat Gan, Israel
N
Noam Bresler
Deepchecks, Ramat Gan, Israel
S
Shir Chorev
Deepchecks, Ramat Gan, Israel
P
Philip Tannor
Deepchecks, Ramat Gan, Israel
Lior Rokach
Lior Rokach
Ben-Gurion University of the Negev
Big Data AnalyticsMachine LearningRecommender SystemsCyber SecurityBiomedical Data Science