Domain-Specific Data Generation Framework for RAG Adaptation

πŸ“… 2025-10-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the scarcity of high-quality, domain-specific training data for Retrieval-Augmented Generation (RAG) systems in dynamic domains such as scientific research and enterprise knowledge bases, this paper proposes RAGenβ€”a modular framework for automated generation of domain-adapted Question-Answer-Context (QAC) triples. Methodologically, RAGen introduces: (1) a hierarchical question-generation mechanism grounded in Bloom’s Taxonomy; (2) a robustness-enhancing strategy integrating semantic chunking, hierarchical concept extraction, and curated distractor contexts; and (3) a joint optimization pipeline coordinating large language models, retrievers, and embedding models via multi-chunk retrieval and precise answer extraction. The framework enables efficient, repetition-free processing of large-scale evolving corpora and supports flexible adaptation across diverse generation strategies. Experimental results demonstrate that RAGen significantly improves response accuracy and domain adaptability of RAG systems in specialized professional settings.

Technology Category

Application Category

πŸ“ Abstract
Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models (LLMs) with external retrieval to enable domain-grounded responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data beyond general-purpose question-answering. Here, we propose RAGen, a scalable and modular framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches. RAGen produces these QAC triples by identifying key concepts in documents, generating diverse questions guided by Bloom's Taxonomy-inspired principles, and pairing them with precise answers extracted from relevant contexts. RAGen supports multiple RAG adaptation strategies, including the optimization of key components such as the LLM, retriever, and embedding model, etc. Its modular pipeline features semantic chunking, hierarchical concept extraction, and multi-chunk retrieval, along with the introduction of curated distractor contexts to promote robust reasoning. Designed for scalability, RAGen efficiently handles large and evolving document corpora without redundant processing, making it especially suitable for dynamic evolving domains such as scientific research and enterprise knowledge bases.
Problem

Research questions and friction points this paper is trying to address.

Generating domain-specific training data for RAG adaptation
Creating scalable question-answer-context triples from documents
Supporting multiple RAG component optimizations with modular framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates domain-specific question-answer-context triples
Uses modular pipeline with semantic chunking and concept extraction
Supports multiple RAG adaptation strategies for optimization
πŸ”Ž Similar Papers
No similar papers found.
C
Chris Xing Tian
Peng Cheng Laboratory, Shenzhen, China
W
Weihao Xie
City University of Hong Kong, Hong Kong SAR
Z
Zhen Chen
City University of Hong Kong, Hong Kong SAR
Z
Zhengyuan Yi
City University of Hong Kong, Hong Kong SAR
H
Hui Liu
City University of Hong Kong, Hong Kong SAR
Haoliang Li
Haoliang Li
Department of Electrical Engineering, City University of Hong Kong
AI SecurityInformation Forensics and SecurityMachine Learning
S
Shiqi Wang
City University of Hong Kong, Hong Kong SAR
S
Siwei Ma
Peking University, Beijing, China