Retrieval Augmented Generation Evaluation for Health Documents

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address information overload, hallucination, and low credibility in LLM applications for medical and scientific documents, this paper introduces RAGEv—the first end-to-end, trustworthiness-oriented RAG evaluation framework tailored to the health domain. Methodologically, it integrates ColBERTv2 for semantic retrieval, Llama-3/Mistral for generation, citation tracing, answer veracity verification, and a multi-granularity traceability assessment protocol. Its contributions are threefold: (1) a standardized evaluation toolkit and benchmark dataset, RAGEv-Bench; (2) the first holistic RAG evaluation paradigm jointly optimizing factual accuracy and provenance traceability, filling a critical gap in systematic validation of healthcare RAG systems; and (3) empirical validation demonstrating high accuracy on both short- and long-answer tasks, substantial hallucination suppression, and suitability for high-stakes applications such as clinical decision support and health policy analysis—though cross-document consistency remains an open challenge.

Technology Category

Application Category

📝 Abstract
Safe and trustworthy use of Large Language Models (LLM) in the processing of healthcare documents and scientific papers could substantially help clinicians, scientists and policymakers in overcoming information overload and focusing on the most relevant information at a given moment. Retrieval Augmented Generation (RAG) is a promising method to leverage the potential of LLMs while enhancing the accuracy of their outcomes. This report assesses the potentials and shortcomings of such approaches in the automatic knowledge synthesis of different types of documents in the health domain. To this end, it describes: (1) an internally developed proof of concept pipeline that employs state-of-the-art practices to deliver safe and trustable analysis for healthcare documents and scientific papers called RAGEv (Retrieval Augmented Generation Evaluation); (2) a set of evaluation tools for LLM-based document retrieval and generation; (3) a benchmark dataset to verify the accuracy and veracity of the results called RAGEv-Bench. It concludes that careful implementations of RAG techniques could minimize most of the common problems in the use of LLMs for document processing in the health domain, obtaining very high scores both on short yes/no answers and long answers. There is a high potential for incorporating it into the day-to-day work of policy support tasks, but additional efforts are required to obtain a consistent and trustworthy tool.
Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG for safe LLM use in healthcare documents
Assessing accuracy of LLM-based health knowledge synthesis
Developing tools to verify RAG performance in medical contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

RAG enhances LLM accuracy for health documents
RAGEv pipeline ensures safe healthcare document analysis
RAGEv-Bench verifies result accuracy and veracity