EnronQA: Towards Personalized RAG over Private Documents

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Existing RAG benchmarks rely on public datasets, limiting their ability to evaluate retrieval and generation performance in private, personalized document settings. To address this gap, we introduce EnronQA—the first benchmark for personalized RAG evaluation on private documents—constructed from 103,000 authentic corporate emails, covering 150 user inboxes and 528,000 high-quality question-answer pairs. Our method introduces three novel components: (i) user-level annotation, (ii) context-aware retrieval evaluation, and (iii) an end-to-end RAG reasoning analysis framework, enabling systematic investigation of the memory-retrieval trade-off in private-document settings. Experimental results demonstrate that personalized retrieval significantly improves QA accuracy, whereas excessive reliance on parametric memory degrades robustness. EnronQA provides a reproducible, fine-grained empirical foundation for designing, evaluating, and optimizing private RAG systems.

Technology Category

Application Category

📝 Abstract

Retrieval Augmented Generation (RAG) has become one of the most popular methods for bringing knowledge-intensive context to large language models (LLM) because of its ability to bring local context at inference time without the cost or data leakage risks associated with fine-tuning. A clear separation of private information from the LLM training has made RAG the basis for many enterprise LLM workloads as it allows the company to augment LLM's understanding using customers' private documents. Despite its popularity for private documents in enterprise deployments, current RAG benchmarks for validating and optimizing RAG pipelines draw their corpora from public data such as Wikipedia or generic web pages and offer little to no personal context. Seeking to empower more personal and private RAG we release the EnronQA benchmark, a dataset of 103,638 emails with 528,304 question-answer pairs across 150 different user inboxes. EnronQA enables better benchmarking of RAG pipelines over private data and allows for experimentation on the introduction of personalized retrieval settings over realistic data. Finally, we use EnronQA to explore the tradeoff in memorization and retrieval when reasoning over private documents.

Problem

Research questions and friction points this paper is trying to address.

Lack of benchmarks for RAG on private documents

Need for personalized retrieval in enterprise RAG

Exploring memorization-retrieval tradeoffs in private data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized RAG over private email datasets

EnronQA benchmark with 528K QA pairs

Balancing memorization and retrieval tradeoffs

🔎 Similar Papers

Is My Data in Your Retrieval Database? Membership Inference Attacks Against Retrieval Augmented Generation