RAGAPHENE: A RAG Annotation Platform with Human Enhancements and Edits

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing RAG evaluation benchmarks suffer from insufficient representation of realistic multi-turn dialogue scenarios and lack fine-grained metrics for factual accuracy. To address this, we propose RAGAPHENE—a human-in-the-loop annotation framework for multi-turn RAG dialogues. It integrates retrieval-augmented generation, dynamic dialogue state tracking, and interactive human editing, enabling annotators to simulate authentic user intent evolution, knowledge refinement, and error correction—thereby producing high-fidelity, factually controllable multi-turn dialogue data. Its core innovation lies in deeply embedding human judgment into the dialogue construction loop, facilitating granular assessment of LLMs along dimensions including factual consistency, retrieval dependency, and response coherence. To date, RAGAPHENE has supported over 40 annotators in curating thousands of high-quality multi-turn RAG dialogues, yielding the first open-source benchmark dataset explicitly designed for factual accuracy evaluation.

Technology Category

Application Category

📝 Abstract

Retrieval Augmented Generation (RAG) is an important aspect of conversing with Large Language Models (LLMs) when factually correct information is important. LLMs may provide answers that appear correct, but could contain hallucinated information. Thus, building benchmarks that can evaluate LLMs on multi-turn RAG conversations has become an increasingly important task. Simulating real-world conversations is vital for producing high quality evaluation benchmarks. We present RAGAPHENE, a chat-based annotation platform that enables annotators to simulate real-world conversations for benchmarking and evaluating LLMs. RAGAPHENE has been successfully used by approximately 40 annotators to build thousands of real-world conversations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on multi-turn RAG conversations

Detecting hallucinated information in LLM responses

Simulating real-world conversations for benchmark creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chat-based annotation platform for RAG conversations

Human annotators simulate real-world multi-turn dialogues

Enables benchmarking LLMs with factual correctness evaluation

🔎 Similar Papers

InspectorRAGet: An Introspection Platform for RAG Evaluation