RAGSynth: Synthetic Data for Robust and Faithful RAG Component Optimization

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two critical bottlenecks in RAG systems: (1) poor robustness of retrievers to logically complex or under-specified queries, and (2) low factual consistency of generators. To this end, we propose the first joint synthetic data modeling framework that simultaneously enhances retrieval robustness and generation faithfulness. Methodologically, we introduce a domain-aware, controllable rule–LLM collaborative synthesis paradigm enabling multi-granularity citation injection and fine-grained annotation; we further design SynthBench—a comprehensive benchmark covering diverse domains, single- and multi-hop reasoning, and varied logical structures. Our contributions are threefold: (1) significant improvements in retrieval recall robustness for complex queries and generation factual accuracy; (2) consistent end-to-end performance gains across multiple RAG paradigms (e.g., dense, sparse, hybrid); and (3) empirical validation of strong cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract
RAG can enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms, including vanilla, planning-based, and iterative RAG, are built upon 2 cores: the retriever, which should robustly select relevant documents across complex queries, and the generator, which should faithfully synthesize responses. However, existing retrievers rely heavily on public knowledge and struggle with queries of varying logical complexity and clue completeness, while generators frequently face fidelity problems. In this work, we introduce RAGSynth, a framework that includes a data construction modeling and a corresponding synthetic data generation implementation, designed to optimize retriever robustness and generator fidelity. Additionally, we present SynthBench, a benchmark encompassing 8 domain-specific documents across 4 domains, featuring diverse query complexities, clue completeness, and fine-grained citation granularity. Leveraging RAGSynth, we generate a large-scale synthetic dataset, including single and multi-hop. Extensive experiments demonstrate that the synthetic data significantly improves the robustness of the retrievers and the fidelity of the generators. Additional evaluations confirm that RAGSynth can also generalize well across different domains. By integrating the optimized retrievers into various RAG paradigms, we consistently observe enhanced RAG system performance. We have open-sourced the implementation on https://github.com/EachSheep/RAGSynth.
Problem

Research questions and friction points this paper is trying to address.

Optimizing retriever robustness for complex queries
Improving generator fidelity in RAG systems
Addressing domain-specific knowledge gaps in RAG
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data generation for RAG optimization
Benchmark with diverse query complexities
Improves retriever robustness and generator fidelity
🔎 Similar Papers
No similar papers found.
H
Haiyang Shen
Institute for Artificial Intelligence, Peking University; School of Computer Science, Peking University
H
Hang Yan
The Chinese University of Hong Kong
Z
Zhongshi Xing
School of Computer Science, Sun Yat-sen University
Mugeng Liu
Mugeng Liu
Peking University
WebAssemblyAI for SEAI for System
Y
Yue Li
School of Software & Microelectronics, Peking University
Z
Zhiyang Chen
Institute for Artificial Intelligence, Peking University; School of Computer Science, Peking University
Y
Yuxiang Wang
School of Computer Science, Peking University
J
Jiuzheng Wang
School of Computer Science, Peking University
Yun Ma
Yun Ma
Assistant Professor, Peking University
WebMobile ComputingSoftware EngineeringService