SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

📅 2024-10-23

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the challenges of adapting general-purpose RAG systems to specialized domains—such as science, medicine, and engineering—where domain-specific annotated data is scarce, this paper proposes an unsupervised domain adaptation framework. Methodologically, it integrates instruction tuning, retrieval-augmented generation (RAG), and self-supervised question generation. Its core contributions are: (1) a novel joint question-answering and question-generation self-training paradigm, wherein an LLM performs domain knowledge distillation via autonomous question posing and answering; and (2) a quality-aware synthetic data filtering mechanism that enables high-fidelity domain enhancement without human annotation. Evaluated across 11 cross-domain benchmarks, the method consistently outperforms strong baselines by 1.2–8.6% in answer accuracy, demonstrating robust adaptability across diverse model scales and three major professional domains.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these self-generated synthetic examples, the LLM can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three domains, demonstrate that SimRAG outperforms baselines by 1.2%--8.6%.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Domain Adaptation

Specialized Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-reinforced Learning

Domain Adaptation

Large Language Models

🔎 Similar Papers

Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval Augmented Generation