🤖 AI Summary
To address the challenges of adapting general-purpose RAG systems to specialized domains—such as science, medicine, and engineering—where domain-specific annotated data is scarce, this paper proposes an unsupervised domain adaptation framework. Methodologically, it integrates instruction tuning, retrieval-augmented generation (RAG), and self-supervised question generation. Its core contributions are: (1) a novel joint question-answering and question-generation self-training paradigm, wherein an LLM performs domain knowledge distillation via autonomous question posing and answering; and (2) a quality-aware synthetic data filtering mechanism that enables high-fidelity domain enhancement without human annotation. Evaluated across 11 cross-domain benchmarks, the method consistently outperforms strong baselines by 1.2–8.6% in answer accuracy, demonstrating robust adaptability across diverse model scales and three major professional domains.
📝 Abstract
Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these self-generated synthetic examples, the LLM can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three domains, demonstrate that SimRAG outperforms baselines by 1.2%--8.6%.